| Student | Matricola | |
|---|---|---|
| Santiago Vessi | 1958879 | vessi.1958879@studenti.uniroma1.it |
| Jose Angel Mola | 2116134 | molaaudi.2116134@studenti.uniroma1.it |
| André Leibrant | 2085698 | leibrant.2085698@studenti.uniroma1.it |
import json
import os
import sqlite3
import time
from collections import Counter
from datetime import datetime
from typing import List, Tuple, Optional, Dict, Union
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import statistics
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col , udf, array_remove, when, sum, count, explode
from pyspark.sql.types import ArrayType, StringType
from scipy.stats import chi2_contingency, ttest_ind
from tabulate import tabulate
from functions import clean_database, database
# Set settings of Pandas so we see all the columns of a Dataframe
pd.set_option('display.max_columns', None)
# File paths
books_file = '/Users/andre/Data Science/Semester_1/ADM/HW2/files/lighter_books.json'
authors_file = '/Users/andre/Data Science/Semester_1/ADM/HW2/files/lighter_authors.json'
list_file = '/Users/andre/Data Science/Semester_1/ADM/HW2/files/list.json'
series_file = '/Users/andre/Data Science/Semester_1/ADM/HW2/files/series.json/series.json'
# Either create a new database called tables.db or establish a connection if already exists
conn, cursor = database(books_file, authors_file, list_file, series_file)
Connection to the database tables.db was successfully established!
We are going to load every json file into a Spark Dataframe, print the first 10 rows, and visualize each schema and summary of every column to get a first idea of our data.
# Initialize Spark session
spark = SparkSession.builder \
.appName("JSONLoader") \
.config("spark.jars", "/Users/andre/Data Science/Semester_1/ADM/HW2/sqlite-jdbc-3.43.2.2.jar") \
.getOrCreate()
23/11/03 14:40:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# Load books file into a Spark Dataframe
sparks_df_books = spark.read.json(books_file)
# Print first 10 rows
books_head = sparks_df_books.limit(10).toPandas()
display(books_head)
23/11/03 14:41:37 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
| asin | author_id | author_name | authors | average_rating | description | edition_information | format | id | image_url | isbn | isbn13 | language | num_pages | original_publication_date | publication_date | publisher | rating_dist | ratings_count | series_id | series_name | series_position | shelves | text_reviews_count | title | work_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.50 | There is a door at the end of a silent corrido... | US Edition | Paperback | 2 | https://i.gr-assets.com/images/S/compressed.ph... | 0439358078 | 9780439358071 | eng | 870.0 | 2003-06-21 | 2004-09 | Scholastic Inc. | 5:1674064|4:664833|3:231195|2:41699|1:16215|to... | 2628006 | 45175 | Harry Potter | 5 | [(324191, to-read), (52675, currently-reading)... | 44716 | Harry Potter and the Order of the Phoenix (Har... | 2809203 | |
| 1 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.48 | Harry Potter's life is miserable. His parents ... | Library Edition | Hardcover | 3 | https://i.gr-assets.com/images/S/compressed.ph... | eng | 309.0 | 1997-06-26 | 2003-11-01 | Scholastic Inc | 5:4801606|4:1681521|3:623286|2:145898|1:125040... | 7377351 | 45175 | Harry Potter | 1 | [(63540, fantasy), (57260, favorites), (20889,... | 116930 | Harry Potter and the Sorcerer's Stone (Harry P... | 4640799 | |||
| 2 | 1077326 | J.K. Rowling | None | 4.43 | The Dursleys were so mean and hideous that sum... | Hardcover | 4 | https://i.gr-assets.com/images/S/compressed.ph... | 0439554896 | 9780439554893 | eng | 352.0 | 1998-07-02 | 2003-11-01 | Scholastic | 5:1690166|4:781011|3:313727|2:54687|1:15453|to... | 2855044 | 45175 | Harry Potter | 2 | [(282341, to-read), (67110, currently-reading)... | 55286 | Harry Potter and the Chamber of Secrets (Harry... | 6231171 | ||
| 3 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.57 | For twelve long years, the dread fortress of A... | Mass Market Paperback | 5 | https://i.gr-assets.com/images/S/compressed.ph... | 043965548X | 9780439655484 | eng | 435.0 | 1999-07-08 | 2004-05-01 | Scholastic Inc. | 5:1994597|4:696545|3:212678|2:28915|1:13959|to... | 2946694 | 45175 | Harry Potter | 3 | [(292815, to-read), (52580, fantasy), (49479, ... | 58023 | Harry Potter and the Prisoner of Azkaban (Harr... | 2402163 | ||
| 4 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.56 | Harry Potter is midway through his training as... | First Scholastic Trade Paperback Edition | Paperback | 6 | https://i.gr-assets.com/images/S/compressed.ph... | eng | 734.0 | 2000-07-08 | 2002-09-28 | Scholastic | 5:1808039|4:663849|3:193604|2:27759|1:12425|to... | 2705676 | 45175 | Harry Potter | 4 | [(287086, to-read), (60111, currently-reading)... | 48637 | Harry Potter and the Goblet of Fire (Harry Pot... | 3046572 | |||
| 5 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.73 | <div>Six years of magic, adventure, and myster... | Box Set | Paperback | 7 | https://i.gr-assets.com/images/S/compressed.ph... | 0439887453 | 9780439887458 | eng | NaN | 2005-01-01 | 2006-09-01 | Scholastic | 5:25063|4:4467|3:1103|2:227|1:282|total:31142 | 31142 | 45175 | Harry Potter | 1-6 | [(5809, to-read), (370, favorites), (286, fant... | 975 | The Harry Potter Collection (Harry Potter, #1-6) | 21457570 | |
| 6 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.79 | Box Set containing Harry Potter and the Sorcer... | Paperback | 8 | https://s.gr-assets.com/assets/nophoto/book/11... | 0439682584 | 9780439682589 | eng | 2690.0 | 2003-10-01 | 2004-09-13 | Scholastic | 5:49438|4:6112|3:1525|2:354|1:534|total:57963 | 57963 | [(7030, to-read), (146, favorites), (125, curr... | 183 | Harry Potter Boxed Set, Books 1-5 (Harry Potte... | 21457576 | |||||
| 7 | 1077326 | J.K. Rowling | None | 4.73 | Six years of magic, adventure, and mystery mak... | Box Set | Hardcover | 10 | https://i.gr-assets.com/images/S/compressed.ph... | 0439827604 | 9780439827607 | eng | 3342.0 | 2005-01-01 | 2005-09-12 | Scholastic | 5:25063|4:4467|3:1103|2:227|1:282|total:31142 | 31142 | 45175 | Harry Potter | 1-6 | [(5809, to-read), (370, favorites), (286, fant... | 975 | Harry Potter Collection (Harry Potter, #1-6) | 21457570 | |
| 8 | 4 | Douglas Adams | None | 4.22 | Seconds before the Earth is demolished to make... | Movie Tie-In Edition | Mass Market Paperback | 11 | https://i.gr-assets.com/images/S/compressed.ph... | eng | 216.0 | 1979-10-12 | 2005 | Del Rey Books | 5:753962|4:436665|3:206876|2:60681|1:32955|tot... | 1491139 | 40957 | The Hitchhiker's Guide to the Galaxy | 1 | [(716080, to-read), (26610, currently-reading)... | 31417 | The Hitchhiker's Guide to the Galaxy (Hitchhik... | 3078186 | |||
| 9 | 4 | Douglas Adams | None | 4.36 | At last in paperback in one complete volume, h... | Leather Bound | 12 | https://i.gr-assets.com/images/S/compressed.ph... | 0517226952 | 9780517226957 | eng | 815.0 | 1996-01-17 | 2005-11-01 | Gramercy Books | 5:167705|4:81013|3:32283|2:8665|1:4360|total:2... | 294026 | 40957 | The Hitchhiker's Guide to the Galaxy | 0.5-5 | [(128191, to-read), (15344, currently-reading)... | 5429 | The Ultimate Hitchhiker's Guide: Five Complete... | 135328 |
# Show schema of the file
sparks_df_books.printSchema()
root |-- asin: string (nullable = true) |-- author_id: long (nullable = true) |-- author_name: string (nullable = true) |-- authors: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- name: string (nullable = true) | | |-- role: string (nullable = true) |-- average_rating: double (nullable = true) |-- description: string (nullable = true) |-- edition_information: string (nullable = true) |-- format: string (nullable = true) |-- id: long (nullable = true) |-- image_url: string (nullable = true) |-- isbn: string (nullable = true) |-- isbn13: string (nullable = true) |-- language: string (nullable = true) |-- num_pages: long (nullable = true) |-- original_publication_date: string (nullable = true) |-- publication_date: string (nullable = true) |-- publisher: string (nullable = true) |-- rating_dist: string (nullable = true) |-- ratings_count: long (nullable = true) |-- series_id: string (nullable = true) |-- series_name: string (nullable = true) |-- series_position: string (nullable = true) |-- shelves: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- count: long (nullable = true) | | |-- name: string (nullable = true) |-- text_reviews_count: long (nullable = true) |-- title: string (nullable = true) |-- work_id: long (nullable = true)
# Show file summary for every column except of the type array
summary_stats = sparks_df_books.describe().toPandas()
display(summary_stats)
| summary | asin | author_id | author_name | average_rating | description | edition_information | format | id | image_url | isbn | isbn13 | language | num_pages | original_publication_date | publication_date | publisher | rating_dist | ratings_count | series_id | series_name | series_position | text_reviews_count | title | work_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | count | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 3955596 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 | 7027431 |
| 1 | mean | Infinity | 1977792.752914116 | 19.0 | 3.306187915612408 | 8.043967999713328E11 | 4.2425113324187964E11 | 189.16666666666666 | 1.3267196411323996E7 | None | Infinity | 9.554914350702076E12 | None | 3049.8654913191335 | 1908.9786988621665 | 1993.1070724310548 | Infinity | None | 19458.609666035853 | 99101.0692560757 | Infinity | 9.504727572188992 | 641.7257807298286 | NaN | 1.3985330931168588E7 |
| 2 | stddev | NaN | 3210660.4887492205 | 0.0 | 1.4357340879502622 | 2.711230852241999E12 | 1.8024371813681375E13 | 187.87380516364345 | 7637921.804473934 | None | NaN | 1.2504441341739316E12 | None | 2327716.909103579 | 304.3699345679781 | 347.81366636795593 | NaN | None | 160900.7621571502 | 64722.70308914834 | NaN | 795.3536595467195 | 4169.806199983704 | NaN | 1.5356372000695132E7 |
| 3 | min | 4 | "Big" John McCarthy | 0.0 | 2 | https://i.gr-assets.com/images/S/compressed.ph... | 0 | 5:-1|4:1|3:1|2:0|1:0|total:1 | 0 | 0 | 62 | ||||||||||||||
| 4 | max | حورات | 21248016 | Toshl | 5.0 | ����������Theodore Roosevelt's account of his ... | 1 | - | 25521017 | https://s.gr-assets.com/assets/nophoto/book/11... | ۹۶۴۳۱ | 9 | zun | 2147483647 | 9999-01-01 | 9999-01-01 | PHP研究所 (PHP新書) | 5:9|4:9|3:9|2:8|1:3|total:38 | 7400639 | 99999 | YASHA―夜叉―(小学館文庫全6巻) | 1 | 174524 | ��قديشة | 89721607 |
# Load authors file into a Spark Dataframe
sparks_df_authors = spark.read.json(authors_file)
# Print first 10 rows
authors_head = sparks_df_authors.limit(10).toPandas()
display(authors_head)
| about | average_rating | book_ids | fans_count | gender | id | image_url | name | ratings_count | text_reviews_count | work_ids | works_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Douglas Noël Adams was an English author, comi... | 4.19 | [386162, 13, 8695, 8694, 6091075, 365, 569429,... | 19826 | male | 4 | https://images.gr-assets.com/authors/159137433... | Douglas Adams | 2862064 | 62681 | [3078186, 135328, 1877624, 74123, 3078120, 104... | 106 |
| 1 | William McGuire "Bill" Bryson, OBE, FRS was bo... | 4.02 | [9791, 21, 28, 24, 7507825, 27, 10538, 25, 26,... | 16144 | male | 7 | https://images.gr-assets.com/authors/157859752... | Bill Bryson | 1417316 | 84176 | [613469, 2305997, 940892, 2611786, 7800569, 31... | 75 |
| 2 | Jude Fisher is the pseudonym for <a href="http... | 4.53 | [349254, 15222, 14833682, 15221, 18126815, 152... | 60 | female | 10 | https://images.gr-assets.com/authors/141145711... | Jude Fisher | 56159 | 352 | [17150, 808427, 20487307, 90550, 25460625, 171... | 14 |
| 3 | James Hamilton-Paterson's work has been transl... | 3.79 | [40, 9416484, 12482, 3753106, 26889789, 104764... | 72 | male | 12 | https://images.gr-assets.com/authors/127051738... | James Hamilton-Paterson | 3302 | 480 | [4417, 14300808, 14780, 3796968, 44703121, 103... | 45 |
| 4 | Mark Andrew Watson (born 13 February 1980) is ... | 3.60 | [8466327, 15739968, 22756778, 51026133, 260451... | 179 | male | 14 | https://images.gr-assets.com/authors/133175379... | Mark Watson | 7979 | 772 | [13330815, 19109351, 42306244, 72694240, 26291... | 61 |
| 5 | Edith Newbold Jones was born into such wealth ... | 3.78 | [53835, 5246, 17728, 212525, 26950, 269528, 85... | 3179 | female | 16 | https://images.gr-assets.com/authors/148451223... | Edith Wharton | 436635 | 26193 | [1959512, 132919, 1652564, 4708346, 2536975, 3... | 1001 |
| 6 | <a target="_blank" rel="noopener nofollow" hre... | 4.59 | [49, 20035, 1070349, 20029, 1070347, 20031, 20... | 10 | male | 17 | https://images.gr-assets.com/authors/118568379... | Luther Butler | 46 | 21 | [4420, 21232, 1057034, 21226, 1057032, 21228, ... | 36 |
| 7 | Although he was never a dedicated student, Pau... | 3.79 | [50, 141372, 2915, 141371, 2913, 6795756, 6694... | 2636 | male | 18 | https://images.gr-assets.com/authors/159591484... | Gary Paulsen | 503435 | 34364 | [1158125, 241811, 6815, 3182664, 6813, 7000800... | 226 |
| 8 | Dale Peck (born 1967 on Long Island, New York)... | 3.66 | [6217775, 19979, 19983, 8325686, 195506, 19551... | 99 | 20 | https://images.gr-assets.com/authors/124562693... | Dale Peck | 6392 | 631 | [6398423, 21178, 1931660, 13273040, 189082, 24... | 26 | |
| 9 | Angela Knight is an American author of mostly ... | 4.00 | [1708127, 7044446, 519655, 19909, 19910, 19916... | 963 | 23 | https://images.gr-assets.com/authors/122559957... | Angela Knight | 100211 | 3650 | [1705297, 7294256, 507563, 21108, 21109, 21115... | 99 |
# Show schema of the file
sparks_df_authors.printSchema()
root |-- about: string (nullable = true) |-- average_rating: double (nullable = true) |-- book_ids: array (nullable = true) | |-- element: string (containsNull = true) |-- fans_count: long (nullable = true) |-- gender: string (nullable = true) |-- id: long (nullable = true) |-- image_url: string (nullable = true) |-- name: string (nullable = true) |-- ratings_count: long (nullable = true) |-- text_reviews_count: long (nullable = true) |-- work_ids: array (nullable = true) | |-- element: string (containsNull = true) |-- works_count: long (nullable = true)
# Show file summary for every column except of the type array
summary_stats = sparks_df_authors.describe().toPandas()
display(summary_stats)
| summary | about | average_rating | fans_count | gender | id | image_url | name | ratings_count | text_reviews_count | works_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | count | 351767 | 351767 | 351767 | 351767 | 351767 | 351767 | 351767 | 351767 | 351767 | 351767 |
| 1 | mean | 1752.326923076923 | 3.651194313281242 | 111.61573143586521 | None | 7751861.1911975825 | None | Infinity | 4770.586308550831 | 330.9932426862099 | 25.937137366495435 |
| 2 | stddev | 437.4607532493448 | 1.2114816423862007 | 2661.018138865955 | None | 6578409.001586667 | None | NaN | 97693.95378788888 | 3857.5391911629276 | 3066.0828516236857 |
| 3 | min | \t\t\t\t\t\t\t\t(2012)<br /><br /><br /> HARRI... | -31.0 | -33 | 4 | https://images.gr-assets.com/authors/116872917... | "Big" John McCarthy | -41 | 0 | 0 | |
| 4 | max | 🥀 Sign up for Kailee’s newsletter → <a target=... | 5.0 | 766035 | wizardess | 21248016 | https://s.gr-assets.com/images/nophoto/user/c_... | Toshl | 27003752 | 608956 | 1775176 |
# Load list file into a Spark Dataframe
sparks_df_list = spark.read.json(list_file)
# Print first 10 rows
list_head = sparks_df_list.limit(10).toPandas()
display(list_head)
| books | created_by | created_date | description | description_html | id | num_books | num_comments | num_likes | num_pages | num_voters | tags | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [(Stephenie Meyer, 941441, 41865, (1, 527144, ... | (73, Michael Economy) | May 20th, 2008 | What do you think are the worst books ever wri... | \n What do you think are the worst books ... | 2 | 7395 | 2570 | 175 | 74 | 18260 | [abominable, abomination, awful, bad, disgusti... | The Worst Books of All Time |
| 1 | [(Orson Scott Card, 589, 375802, (1, 572155, 5... | (, deleted user) | May 29th, 2008 | Anything in the science fiction or fantasy gen... | \n Anything in the science fiction or fan... | 3 | 7023 | 307 | 2989 | 71 | 21747 | [best-fantasy, best-science-fiction, earliest-... | Best Science Fiction & Fantasy Books |
| 2 | [(Suzanne Collins, 153394, 2767052, (1, 205279... | (, deleted user) | May 29th, 2008 | The best books first published during 2008.See... | \n The best books first published during ... | 4 | 1678 | 181 | 59 | 17 | 6939 | [2008, best, by-year, earliest-list, fiction, ... | Best Books of 2008 |
| 3 | [(J.K. Rowling, 1077326, 136251, (1, 723217, 7... | (, deleted user) | May 29th, 2008 | The best books published during the 2000s deca... | \n The best books published during the 20... | 5 | 6919 | 163 | 1192 | 70 | 28249 | [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... | Best Books of the Decade: 2000s |
| 4 | [(Harper Lee, 1825, 2657, (1, 965468, 9773), T... | (, deleted user) | May 29th, 2008 | The best books published during the 20th centu... | \n The best books published during the 20... | 6 | 7600 | 510 | 6083 | 76 | 49437 | [20th, 20th-century, best, by-century, by-year... | Best Books of the 20th Century |
| 5 | [(J.K. Rowling, 1077326, 136251, (1, 398786, 4... | (, deleted user) | May 29th, 2008 | The best books published during the 21st centu... | \n The best books published during the 21... | 7 | 9032 | 280 | 2024 | 91 | 21393 | [2001, 2002, 2003, 2004, 2005, 2006, 2007, 200... | Best Books of the 21st Century |
| 6 | [(J.K. Rowling, 1077326, 6, (1, 38157, 390), H... | (73, Michael Economy) | June 4th, 2008 | Original novel, not omnibus editions.500 pages... | \n Original novel, not omnibus editions.<... | 8 | 862 | 104 | 118 | 9 | 2602 | [500-pages, 500-pages-plus, big-books, earlies... | Thickest Books Ever |
| 7 | [(Margaret Atwood, 3472, 38447, (1, 40683, 416... | (, deleted user) | June 6th, 2008 | The best books published during the 1980s deca... | \n The best books published during the 19... | 9 | 2059 | 41 | 240 | 21 | 2323 | [1980, 1980s, 1981, 1982, 1983, 1984, 1985, 19... | Best Books of the Decade: 1980s |
| 8 | [(Barbara Kingsolver, 3541, 7244, (1, 10041, 1... | (, deleted user) | June 9th, 2008 | Best books to read on an African Safari. Books... | \n Best books to read on an African Safar... | 10 | 445 | 9 | 103 | 5 | 705 | [africa, african, earliest-list, location, saf... | Best books for an African Safari |
| 9 | [(Stieg Larsson, 706255, 2429135, (1, 297878, ... | (, deleted user) | June 10th, 2008 | The best of crime and mystery books.If the boo... | \n The best of crime and mystery books.<b... | 11 | 6112 | 138 | 1827 | 62 | 14241 | [best, crime, crime-mystery, mystery] | Best Crime & Mystery Books |
# Show schema of the file
sparks_df_list.printSchema()
root |-- books: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- author: string (nullable = true) | | |-- author_id: string (nullable = true) | | |-- book_id: string (nullable = true) | | |-- position: struct (nullable = true) | | | |-- ranking: long (nullable = true) | | | |-- score: long (nullable = true) | | | |-- votes: long (nullable = true) | | |-- title: string (nullable = true) |-- created_by: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- created_date: string (nullable = true) |-- description: string (nullable = true) |-- description_html: string (nullable = true) |-- id: string (nullable = true) |-- num_books: long (nullable = true) |-- num_comments: long (nullable = true) |-- num_likes: long (nullable = true) |-- num_pages: long (nullable = true) |-- num_voters: long (nullable = true) |-- tags: array (nullable = true) | |-- element: string (containsNull = true) |-- title: string (nullable = true)
# Show file summary for every column except of the type array
summary_stats = sparks_df_list.describe().toPandas()
display(summary_stats)
| summary | created_date | description | description_html | id | num_books | num_comments | num_likes | num_pages | num_voters | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | count | 81511 | 81512 | 81512 | 81512 | 81512 | 81512 | 81512 | 81512 | 81512 | 81512 |
| 1 | mean | None | 503.0 | 503.0 | 82154.03516046717 | 89.25469869467072 | 1.515641868681912 | 15.345078025321426 | 1.5542742172931594 | 74.43721169889096 | 1403.25 |
| 2 | stddev | None | 845.7765662395714 | 845.7765662395714 | 47494.248649367466 | 303.8132122325048 | 14.968021371678228 | 97.12905784545934 | 3.004615235256749 | 719.0422150108883 | 875.7155795780897 |
| 3 | min | April 10th, 2009 | \n \tThe residents of Rolling Hills, a ha... | 10 | 0 | 0 | 0 | 1 | 0 | !¡War-All-The-Time¡! | |
| 4 | max | September 9th, 2020 | 🦃 Books that will be published in November 201... | \n<u>Update</u><br>I've added a dozen more boo... | 99999 | 29066 | 2570 | 16830 | 291 | 99382 | 전략적 도서 |
# Load series file into a Spark Dataframe
sparks_df_series = spark.read.json(series_file)
# Print first 10 rows
series_head = sparks_df_series.limit(10).toPandas()
display(series_head)
| description | id | note | numbered | primary_work_count | series_works_count | title | works | |
|---|---|---|---|---|---|---|---|---|
| 0 | Also known as:\n * Drakenrijders van Pern\n... | 40323 | None | true | 3 | 4 | Dragonriders of Pern | [(117, 61975, Dragonflight (Dragonriders of Pe... |
| 1 | These books continue the larger <a href=http:/... | 40325 | None | true | 4 | 16 | Heritage of Shannara | [(68, 34988, The Scions of Shannara (Heritage ... |
| 2 | None | 40326 | None | true | 3 | 6 | Dragonlance: Legends | [(53, 29187, Time of the Twins (Dragonlance: L... |
| 3 | The War of Souls is a war set in the popular D... | 40328 | None | true | 4 | 4 | Dragonlance: The War of Souls | [(31, 259858, Dragons of a Fallen Sun (Dragonl... |
| 4 | None | 40330 | None | true | 3 | 3 | Dragonlance: The Lost Chronicles | [(33, 92909, Dragons of the Dwarven Depths (Dr... |
| 5 | None | 40331 | None | true | 2 | 2 | Starshield | [(11, 28510, The Mantle of Kendis-Dai (Starshi... |
| 6 | A kids' and young adult, science-fiction serie... | 40332 | This order is listed in publication order, NOT... | true | 4 | 7 | Book of Ember | [(73, 307791, The City of Ember (Book of Ember... |
| 7 | Patrick O'Brian's beloved series of 19th-centu... | 40333 | None | true | 21 | 21 | Aubrey & Maturin | [(121, 77430, Master and Commander (Aubrey & M... |
| 8 | Steven Brust's fantasy series follows Vlad Tal... | 40334 | None | true | 15 | 27 | Vlad Taltos | [(20, 133454, Jhereg (Vlad Taltos, #1), 1, 152... |
| 9 | The original series has 7 novels. In 1976, Mah... | 40336 | None | true | 6 | 13 | Pellucidar | [(951, 215950, Tarzan At the Earth's Core (Tar... |
# Show schema of the file
sparks_df_series.printSchema()
root |-- description: string (nullable = true) |-- id: string (nullable = true) |-- note: string (nullable = true) |-- numbered: string (nullable = true) |-- primary_work_count: string (nullable = true) |-- series_works_count: string (nullable = true) |-- title: string (nullable = true) |-- works: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- books_count: string (nullable = true) | | |-- edition_id: string (nullable = true) | | |-- title: string (nullable = true) | | |-- user_position: string (nullable = true) | | |-- work_id: string (nullable = true)
# Show file summary for every column except of the type array
summary_stats = sparks_df_series.describe().toPandas()
display(summary_stats)
| summary | description | id | note | numbered | primary_work_count | series_works_count | title | |
|---|---|---|---|---|---|---|---|---|
| 0 | count | 47432 | 226659 | 5914 | 226659 | 226659 | 226659 | 226654 |
| 1 | mean | 1003.5 | 180317.59149206517 | 6.523565037354E12 | None | 5.143131311794369 | 5.628605085171999 | NaN |
| 2 | stddev | 1121.4658562197365 | 76648.1231153565 | 5.649574861327897E12 | None | 12.801531989070972 | 14.092438076818432 | NaN |
| 3 | min | " New Windmills is a name you know and trust. ... | 100000 | !! Do NOT add shorts that are not available se... | false | 0 | -14 | ". . . And Gulliver Returns" --In Search of Ut... |
| 4 | max | ~ 巻 | 99999 | 양장본 | true | 99 | 99 | ~者 |
After we took a look at the data and the research questions we have to answer we created a function that creates a local database using the SQLite3 package including the following tables: books, authors, book_to_author, series, and worst_books_list (Figure 1). This way we process every file only once and are then able to execute all our analysis using SQL.

Figure: Entity-Relationship Diagram (ERD)
Comments:
rating_dist in the books file into one column for each rating of 1-5.book_ids in the authors file into a mapped table book_to_author.year and month based on the publication_date in the books file.The Worst Books of All Time.Before we solved the other research questions we cleaned the data in the following way:
title.'YYYY-MM-DD' in publication_date and/or original_publication_date. Also if the date is in the future.num_pages or with a value of 0.ratings_count.average_rating.text_reviews_count.fans_count.name.# Clean the data
clean_database()
# Set settings of Pandas so we see the whole column and not truncated
pd.set_option('display.max_colwidth', None)
IMPORTANT
Through out all research questions we applied the following assumption: We are inner joining the data between the tables books, book_to_author, and authors on author_id AND book_id because we realized that there are mistakes inside the mapped table book_to_author where the wrong author_id is connected to some book_id. If you look up the same book_id inside the books table you will find sometimes a different author_id (look at the example below). After validating a few cases we decided to trust the data inside the books table and inner join on both ids to make sure that we extract the correct books for an author. This way we are also considering only one single book if there are multiple versions (meaning there are multiple book_ids behind a single work_id) which clears up our data even more. And we decided to apply this assumption not only if we speak about books of autors but ALSO if we retrieve data only from authors or books. Basically we look at the subset between both files. This way we make sure all entries are valid because we exactly know the connection between a book and an author and vice versa.
# Query to get first 10 rows of Stephen King (id=3389) ordered by original_publication_date in acending order
query = f"""
select a.id,
a.title,
a.original_publication_date,
a.author_id,
a.name
from (
select a.id,
a.title,
a.original_publication_date,
a.author_id,
b.name
from books a
inner join authors b
on a.author_id = b.id
) a
inner join book_to_author b
on a.id = b.book_id
inner join authors c
on c.id = b.author_id
where b.author_id = 3389
order by date(a.original_publication_date) asc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["book_id", "title", "original_publication_date", "author_id", "name"])
df.head(10)
| book_id | title | original_publication_date | author_id | name | |
|---|---|---|---|---|---|
| 0 | 10564744 | Dracula | 1897-05-26 | 6988 | Bram Stoker |
| 1 | 219513 | Shadows 4 | 1960-01-01 | 717781 | Charles L. Grant |
| 2 | 10592 | Carrie | 1974-04-05 | 3389 | Stephen King |
| 3 | 5413 | 'Salem's Lot | 1975-10-17 | 3389 | Stephen King |
| 4 | 11588 | The Shining | 1977-01-28 | 3389 | Stephen King |
| 5 | 66370 | Rage | 1977-09-06 | 5858 | Richard Bachman |
| 6 | 20506198 | Корпорация "Бросайте курить" (Night shift part 1 of 2) | 1978-01-01 | 3389 | Stephen King |
| 7 | 1852861 | The Year's Finest Fantasy | 1978-07-01 | 12532 | Terry Carr |
| 8 | 149267 | The Stand | 1978-10-03 | 3389 | Stephen King |
| 9 | 6296531 | The Jaunt. Travel | 1981-01-01 | 3389 | Stephen King |
As we can see, if we would only inner join by book_id we would get wrong books for some authors!
# Query to get books count for every author in descending order
query = f"""
select c.name,
count(a.id) as books_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by c.name
order by books_count desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author", "books_count"])
df.head(10)
| author | books_count | |
|---|---|---|
| 0 | Frederic P. Miller | 1411 |
| 1 | Jesse Russell | 1142 |
| 2 | Philip M. Parker | 1132 |
| 3 | Livres Groupe | 830 |
| 4 | Food and Agriculture Organization of the United Nations | 698 |
| 5 | Institute of Electrical and Electronics Engineers | 692 |
| 6 | National Research Council | 690 |
| 7 | Hal Leonard Corporation | 679 |
| 8 | Roger Priddy | 669 |
| 9 | Jacob Neusner | 633 |
To plot the results we are going to extract only the unique book_count values and represent the number of books for each author in descending order in a histogram.
# Extract only unique values
unique_books_count = df['books_count'].unique().tolist()
# Plot results
plt.bar(range(len(unique_books_count)), unique_books_count)
plt.ylabel('Books Count')
plt.title('Number of books for each author in descending order')
plt.xticks([])
plt.show()
In addition to that we digged a bit deeper and defined buckets with their corresponding count and percentage.
# Define bin sizes
bins = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 100, 1410]
# Create bins
df["books_count_binned"] = pd.cut(df["books_count"], bins=bins)
groups = df.groupby("books_count_binned")
# Calculate bin count and percentage
group_sizes = groups.size()
group_percentages = ((group_sizes/group_sizes.sum())*100)
# Put results into a Pandas Datframe
df_buckets = pd.DataFrame(group_percentages, columns=["Percentage"])
df_buckets["count"] = group_sizes
df_buckets.reset_index(inplace=True)
df_buckets.columns = ["bucket", "percentage", "count"]
df_buckets["percentage"] = df_buckets["percentage"].apply(lambda x: '{:.2f}%'.format(x))
df_buckets
/var/folders/l8/ymkxjsd17s7cy06pzjjjj4sr0000gn/T/ipykernel_49707/2049938279.py:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
groups = df.groupby("books_count_binned")
| bucket | percentage | count | |
|---|---|---|---|
| 0 | (0, 1] | 40.31% | 71772 |
| 1 | (1, 2] | 18.90% | 33661 |
| 2 | (2, 3] | 10.94% | 19485 |
| 3 | (3, 4] | 6.90% | 12289 |
| 4 | (4, 5] | 4.72% | 8413 |
| 5 | (5, 6] | 3.38% | 6022 |
| 6 | (6, 7] | 2.51% | 4472 |
| 7 | (7, 8] | 1.97% | 3512 |
| 8 | (8, 9] | 1.48% | 2640 |
| 9 | (9, 10] | 1.19% | 2127 |
| 10 | (10, 20] | 4.92% | 8766 |
| 11 | (20, 30] | 1.34% | 2384 |
| 12 | (30, 40] | 0.54% | 963 |
| 13 | (40, 50] | 0.28% | 492 |
| 14 | (50, 100] | 0.43% | 760 |
| 15 | (100, 1410] | 0.17% | 309 |
Conclusion: Most of the authors (40.31%) wrote only one book, 92.30% wrote up to 10 books, and 98.56% wrote up to 30 books.
In this case we can easily sort the text_reviews_count column in descending order and select only the first 10 entries.
# Query to get book title that has the highest number of reviews
query = f"""
select a.title,
a.text_reviews_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
order by a.text_reviews_count desc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["title", "text_reviews_count"])
df
| title | text_reviews_count | |
|---|---|---|
| 0 | The Hunger Games (The Hunger Games, #1) | 172809 |
| 1 | The Fault in Our Stars | 156027 |
| 2 | Gone Girl | 133389 |
| 3 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) | 116930 |
| 4 | The Book Thief | 113729 |
| 5 | The Girl on the Train | 110736 |
| 6 | Divergent (Divergent, #1) | 108587 |
| 7 | Twilight (Twilight, #1) | 105057 |
| 8 | Catching Fire (The Hunger Games, #2) | 97295 |
| 9 | To Kill a Mockingbird | 93809 |
Answer: The book with the highest number of reviews is The Hunger Games (The Hunger Games, #1) with 172.809 reviews.
We added the following assumption: We only consider books with more than 1000 reviews for a better representation of our results. Using this assumption we sorted the average_rating in the books table in descending order and extracted only the first 10 entries to get the top ten books.
# Query to get the top ten books concerning the average score
query = f"""
select a.title,
a.average_rating
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where a.ratings_count > 1000
order by a.average_rating desc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["title", "average_rating"])
df
| title | average_rating | |
|---|---|---|
| 0 | The Complete Calvin and Hobbes | 4.82 |
| 1 | Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5) | 4.79 |
| 2 | ESV Study Bible | 4.78 |
| 3 | Ignatius Catholic Study Bible: New Testament | 4.78 |
| 4 | It's a Magical World (Calvin and Hobbes #11) | 4.76 |
| 5 | There's Treasure Everywhere (Calvin and Hobbes #10) | 4.75 |
| 6 | The Jesus Storybook Bible: Every Story Whispers His Name | 4.75 |
| 7 | Fullmetal Alchemist Complete Box Set | 4.75 |
| 8 | Words of Radiance (The Stormlight Archive, #2) | 4.75 |
| 9 | Words of Radiance, Part 1 (The Stormlight Archive #2, Part 1 of 2) | 4.75 |
By changing the sorting in ascending order we get the worst 10 books.
# Query to get the worst ten books concerning the average score
query = f"""
select a.title,
a.average_rating
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where a.ratings_count > 1000
order by a.average_rating asc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["title", "average_rating"])
df
| title | average_rating | |
|---|---|---|
| 0 | Dianetics: The Modern Science of Mental Health | 2.39 |
| 1 | Citizen Girl | 2.40 |
| 2 | Cleaving: A Story of Marriage, Meat, and Obsession | 2.44 |
| 3 | Alison Wonderland | 2.45 |
| 4 | Pemberley: Or Pride and Prejudice Continued | 2.47 |
| 5 | Cum Laude | 2.52 |
| 6 | Actors Anonymous | 2.60 |
| 7 | Tris & Izzie | 2.61 |
| 8 | Stories from Candyland: Confections from One of Hollywood's Most Famous Wives and Mothers | 2.62 |
| 9 | Laura Rider's Masterpiece | 2.64 |
First we extracted a list with all unique values for languages with their corresponding count and selected the top 7. After that we mapped all as following: For English we grouped the following five values 'en', 'en-CA', 'en-GB', 'en-US', and 'eng', if the field is empty we map it as NA and all the other languages as Others.
Using this logic we create a new group variable language_groups and corresponding language_count. We sorted by language_count in descending order and plot the results in a bar plot.
# Query to get the list with different languages and their count
query = f"""
select case when a.language = 'ara' then 'Arabian'
when a.language in ('en', 'en-CA', 'en-GB', 'en-US', 'eng') then 'English'
when a.language = 'fre' then 'French'
when a.language = 'ger' then 'German'
when a.language = 'ita' then 'Italian'
when a.language = 'jpn' then 'Japanese'
when a.language = 'spa' then 'Spanish'
when a.language = '' then 'NA'
else 'Others' end as language_groups,
count(*) as language_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by language_groups
order by language_count desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["language", "language_count"])
# Plot result
plt.figure(figsize=(16, 6))
plt.barh(df["language"], df["language_count"], color="skyblue")
plt.xlabel("Count")
plt.ylabel("Language")
plt.title("Language Distribution in the Book’s Dataset")
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '{:.0f} Thousand'.format(x/1000)))
bars = plt.barh(df["language"], df["language_count"], color="skyblue")
for bar, count in zip(bars, df["language_count"]):
plt.text(bar.get_width(),
bar.get_y() + bar.get_height() / 2,
'{:,.0f}'.format(count),
ha="left",
va="center",
fontsize=10)
plt.show()
Conclusion: Most of the books with a count of 474.974 have a missing value for the field language. For the books who have a valid entry English has the highest count with 252.104.
For this we are filtering all books with num_pages larger than 250 and count the total entries.
# Query to get books count with more than 250 pages
query = f"""
select count(a.id) as books_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where a.num_pages > 250
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["books_count"])
print("Books with more than 250 pages:", df.iloc[0, 0])
Books with more than 250 pages: 307924
Answer: 307.924 books have more than 250 pages.
To get the 50 most prolific authors we group by the author name and their fans_count and create a new group variable books_count which we sort in descending order and select at the end the first 50 entries.
# Query to get the fans count for the 50 most prolific authors
query = f"""
select c.name,
c.fans_count,
count(*) as books_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by c.name, c.fans_count
order by books_count desc
limit 50
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author", "fans_count", "books_count"])
# Plot results
df["fans_count"].plot(kind="kde",
title="Density Distribution of Fans Count for 50 Most Prolific Authors",
xlabel="Fans Count")
df.plot.scatter(x="books_count", y="fans_count")
plt.xlabel("Books Count")
plt.ylabel("Fans Count")
plt.title("Scatter Plot of Fans Count for 50 Most Prolific Authors")
plt.show()
Analysis: We don't see any linear relationship between the fans count and books count. Meaning, an author doesn't have proportionally more fans if he/she wrote more books than someone else.
Inside our function that takes a given year we can then using this year for filtering the publication_year column in books. We the group by the column month in books and sort by by the newly created group variables books_count and pages_count as a tie breaker. Our query returns a table with 12 entries one entry for every month of the given year. Selecting the first entry we answer the question of the most prolific month of the year. For the other output variables we sum the column books_count to get the total number of books published that year, sum the column pages_count to get total number of pages written that year and we just select the value in the third column for any month to get the title of the longest book written that year.
def historical_look(year: str) -> Tuple[int, int, str, str]:
"""
Function that takes a given year and calculates number of books published, total number of pages written,
most prolific month, and the longest book written that year.
Args:
year (int): Year to get historical look from.
Returns:
books_count (int): Number of books published that year.
pages_count (int): Total number of pages written that year.
most_prolific_month (str): Most prolific month of that year.
longest_book (str): Longest book written that year.
"""
# Query to get books and pages count for every month sorted by books count and pages count as a tie breaker
query = f"""
select count(a.id) as books_count,
sum(a.num_pages) as pages_count,
case when a.publication_month = '01' then 'January'
when a.publication_month = '02' then 'February'
when a.publication_month = '03' then 'March'
when a.publication_month = '04' then 'April'
when a.publication_month = '05' then 'May'
when a.publication_month = '06' then 'June'
when a.publication_month = '07' then 'July'
when a.publication_month = '08' then 'August'
when a.publication_month = '09' then 'September'
when a.publication_month = '10' then 'October'
when a.publication_month = '11' then 'November'
when a.publication_month = '12' then 'December'
end as month,
(
select title
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where a.publication_year = '{year}'
order by num_pages desc
limit 1
) as title
FROM books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where a.publication_year = '{year}'
group by month
order by books_count desc, pages_count desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["books_count", "pages_count", "month", "title"])
books_count = df["books_count"].sum()
pages_count = df["pages_count"].sum()
most_prolific_month = df.iloc[0, 2]
longest_book = df.iloc[0, 3]
return books_count, pages_count, most_prolific_month, longest_book
This would be the return for the given year of 1997:
# Given year
year = 1997
books_count, pages_count, most_prolific_month, longest_book = historical_look(year)
print(f"Number of books published in {year}:", books_count, "\n")
print(f"Total number of pages written in {year}:", pages_count, "\n")
print(f"Most prolific month of {year}:", most_prolific_month, "\n")
print(f"Longest book written in {year}:", longest_book)
Number of books published in 1997: 10900 Total number of pages written in 1997: 2715391 Most prolific month of 1997: January Longest book written in 1997: Gesammelte Schriften in 20 Bänden
First we are extract all unique years books were published by grouping by publication_year and sort in ascending order. Which we then use to create a loop that iterates through every year in this list and inserts the output of our defined function into a Pandas Dataframe. At last we set the year as our primary key in the Dataframe.
# Query to get all unique years books were published
query = f"""
select a.publication_year
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by a.publication_year
order by a.publication_year asc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["year"])
# Save years in a list
years = df["year"].tolist()
# Create an empty Dataframe
data = {
"year": [],
"books_count": [],
"pages_count": [],
"most_prolific_month": [],
"longest_book": []
}
df = pd.DataFrame(data)
# Iteretate through all years
for year in years:
# Get results for current year
books_count, pages_count, most_prolific_month, longest_book = historical_look(year)
# Add results to a dictionary
data = {
"year": [year],
"books_count": [books_count],
"pages_count": [pages_count],
"most_prolific_month": [most_prolific_month],
"longest_book": [longest_book]
}
# Add entry to dataframe
df = pd.concat([df, pd.DataFrame(data)])
# Change format from columns
df["year"] = df["year"].astype(int)
df["books_count"] = df["books_count"].astype(int)
df["pages_count"] = df["pages_count"].astype(int)
# Set year as primary key
df.set_index("year", inplace=True)
This are our first 10 entries:
# Print the first 10 rows
display(df.head(10))
| books_count | pages_count | most_prolific_month | longest_book | |
|---|---|---|---|---|
| year | ||||
| 1011 | 1 | 150 | March | How to Find Yourself (or a reasonable facsimile) - Illustrated Version |
| 1012 | 5 | 705 | May | Tara's Tears |
| 1013 | 2 | 276 | October | Satan Reborn |
| 1025 | 1 | 118 | February | Strings of Life |
| 1211 | 1 | 324 | November | The Magic of the Camino |
| 1212 | 1 | 108 | December | The Lamb (The Birthright #3) |
| 1213 | 2 | 472 | December | Embroidering Andrea, a Quilted Mystery novel (Quilted Mystery, #3) |
| 1352 | 1 | 128 | May | كاخ ابداع |
| 1358 | 1 | 160 | May | کتاب جمعه |
| 1370 | 2 | 620 | January | Miroir Historial [Speculum historiale] vol. II. ;Livres VII-XI |
This are our last 10 entries:
# Print the last 10 rows
display(df.tail(10))
| books_count | pages_count | most_prolific_month | longest_book | |
|---|---|---|---|---|
| year | ||||
| 2014 | 86953 | 29652254 | May | Fermata Qt8 - Storie Ripensate in Metropolitana |
| 2015 | 31722 | 7832802 | January | A Most Unlikely Countess (To Love a Wildcat, #2) |
| 2016 | 2587 | 744412 | January | Homestuck |
| 2017 | 754 | 211486 | February | War |
| 2018 | 521 | 171527 | February | Theory of Knowledge (Tok) (Student Book and Etext) (Pearson Baccalaureate) |
| 2019 | 327 | 85417 | January | Breast Ultrasound |
| 2020 | 234 | 64675 | March | International Relations of the Cold War |
| 2021 | 53 | 15878 | January | Chain of Iron (The Last Hours, #2) |
| 2022 | 3 | 868 | May | Weltgesellschaft: Raum - Kommunikation - Herrschaft |
| 2023 | 1 | 463 | September | Apocalypse |
import pandas as pd
def historical_analysis(year):
# Filter the DataFrame for books published in the given year
books_in_year = books_df[books_df['publication_date'].str.startswith(year)]
# Number of books published in the given year
num_books = len(books_in_year)
# Total number of pages written in the given year
total_pages = books_in_year['num_pages'].sum()
# Group books by month of publication and find the most prolific month
books_in_year['publication_date'] = pd.to_datetime(books_in_year['publication_date'])
print(books_in_year)
books_by_month = books_in_year.groupby(books_in_year['publication_date'].dt.month)
print(pd.DataFrame(books_by_month))
most_prolific_month = books_by_month.size().idxmax()
# Find the longest book written in the given year
longest_book = books_in_year.loc[books_in_year['num_pages'].idxmax()]
# Example usage: Get information for the year 2020
year_2020_info = historical_analysis('2020')
print(year_2020_info)
import pandas as pd
# Load the "books" dataset (replace 'your_books.csv' with the actual file path)
books_df = pd.read_csv('your_books.csv')
# Define the range of years you want to analyze
start_year = 2000
end_year = 2022 # Adjust this to your dataset's range of years
# Create a list of dictionaries to store the information for each year
yearly_info = []
# Iterate through the years and collect information
for year in range(start_year, end_year + 1):
year_info = historical_analysis(str(year))
yearly_info.append(year_info)
# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(yearly_info)
# Show the head and tail of the DataFrame for the first and last ten years
head_and_tail = pd.concat([df.head(10), df.tail(10)])
# Display the resulting DataFrame
print(head_and_tail)
We are going to see if Chat GPT function si correct. To prove this, we are going to imagine we have the following table with the following values (we have only took the columns that are necessary for exercise):
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book1 | 2023-01-15 | 300 |
| Book2 | 2022-09-10 | 250 |
| Book3 | 2021-03-20 | 525 |
| Book4 | 2020-03-19 | 330 |
| Book5 | 2022-08-24 | 275 |
| Book6 | 2022-02-11 | 400 |
| Book7 | 2022-02-18 | 220 |
We are going to suppose we want to work with year = "2022".
The first line of code is:
books_in_year = books_df[books_df['publication_date'].str.startswith("2022")]
In this case ChatGPT has supposed that all information was in books_df, a Pandas DataFrame. We have not worked started working with a Pandas DataFrame but we have supposed the initial content of the table is the same.
The function str.startswith takes a string and returns True if this string begins with the indicated parameter, in this case, "2022". So we have only the rows publicated in 2022.
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book2 | 2022-09-10 | 250 |
| Book5 | 2022-08-24 | 275 |
| Book6 | 2022-02-11 | 400 |
| Book7 | 2022-02-18 | 220 |
So, it is easy to get the number of books of this year knowing the size of this DataFrame.
num_books = len(books_in_year)
The following line of code is:
total_pages = books_in_year['num_pages'].sum()
This method takes all the values of a column and sums the entries. In this case, 250 + 275 + 400 = 925, so we have the number of pages written this year.
In order to find the most prolific month, ChatGPT wrote:
books_in_year['publication_date'] = pd.to_datetime(books_in_year['publication_date'])
books_by_month = books_in_year.groupby(books_in_year['publication_date'].dt.month)
most_prolific_month = books_by_month.size().idxmax()
The first line takes the column publication_date and transform the type of this column to datetime. Doing this, it is easy to get the month.
The second line performs a group by that allocates every different month in a different DataFrame. In our case, we would have:
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book2 | 2022-09-10 | 250 |
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book5 | 2022-08-24 | 275 |
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book6 | 2022-02-11 | 400 |
| Book7 | 2022-02-18 | 220 |
And the third line returns the maximum size of this three DataFrames, in our case, the month with the highest number of publications. So February would be the most prolific month.
To conclude, the longest book is found using this code:
longest_book = books_in_year.loc[books_in_year['num_pages'].idxmax()]
Here it takes all the values from column num_pages and returns the row with the highest number of pages. In our case:
| TITLE | PUBLICATION_DATE | NUM_PAGES |
|---|---|---|
| Book6 | 2022-02-11 | 400 |
And it is all returned using a Dictionary.
To do a DataFrame for every year, ChatGPT simply uses a for that run among all years and creates a DataFrame from all the dictionaries of all the years of the first function. To return the first and last ten years, uses df.head(10) and df.tail(10) and concatenates this two DataFrames.
So it is correct.
We are going to extract all names from the authors table and create a regex that removes the defined special cases.
# Query to get all authors
query = """
select distinct c.id,
c.name
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["id", "name"])
# Regex pattern
pattern = '[,;.:-_#+*<>!"§$%&/()=?´`“¶¢\[\]\\{}≠¿¬”#£fi\˜·¯˙˚ -\x27]'
# Clean the names with the defined regex
df["name_cleaned"] = df["name"].str.lower().replace(pattern, '', regex=True)
# Group by the new column and count
grouped_df = df.groupby("name_cleaned").size().reset_index(name="count")
# Sort the count column by descending order and remove all counts with 1
sorted_df = grouped_df.sort_values(by="count", ascending=True)
sorted_df = sorted_df[sorted_df["count"] > 1]
# Inner join original dataframe with the new dataframe to extract eponymous
result = pd.merge(sorted_df, df, on="name_cleaned", how="inner")
# Count all eponymous
eponymous = result["name_cleaned"].nunique()
print("Count of all eponymous:", eponymous)
print(tabulate(result.head(20), headers="keys", tablefmt="psql"))
Count of all eponymous: 5937 +----+--------------------+---------+----------+----------------------+ | | name_cleaned | count | id | name | |----+--------------------+---------+----------+----------------------| | 0 | josephhoffman | 2 | 13747629 | Joseph Hoffman | | 1 | josephhoffman | 2 | 13990676 | Joseph Hoffman | | 2 | howardjones | 2 | 32661 | Howard Jones | | 3 | howardjones | 2 | 5831171 | Howard Jones | | 4 | marksallen | 2 | 6001035 | Mark S. Allen | | 5 | marksallen | 2 | 15194644 | Mark S. Allen | | 6 | johnmduncan | 2 | 2618030 | John M. Duncan | | 7 | johnmduncan | 2 | 16247576 | John M. Duncan | | 8 | petertemple | 2 | 174528 | Peter Temple | | 9 | petertemple | 2 | 4087121 | Peter Temple | | 10 | peterconrad | 2 | 62824 | Peter Conrad | | 11 | peterconrad | 2 | 3125305 | Peter Conrad | | 12 | philcollins | 2 | 136961 | Phil Collins | | 13 | philcollins | 2 | 14907462 | Phil Collins | | 14 | marymcauliffe | 2 | 372558 | Mary McAuliffe | | 15 | marymcauliffe | 2 | 19126716 | Mary McAuliffe | | 16 | phyllishoffman | 2 | 165492 | Phyllis Hoffman | | 17 | phyllishoffman | 2 | 7417118 | Phyllis Hoffman | | 18 | catherinehernandez | 2 | 199206 | Catherine Hernandez | | 19 | catherinehernandez | 2 | 16873595 | Catherine Hernandez | +----+--------------------+---------+----------+----------------------+
Conclusion: It is not true that the authors dataset has no eponymous. As we showed in the previous step the dataset has 5937 different eponymous in total.
Our function takes a list of author ids, creates a joined string out of them so we can easily add this string inside a where clause in our query which will only consider the titles of the given authors at the end.
def authors_dict(author_list: List) -> Dict:
"""
Function that takes a list of authors and returns a dictionary with all the titles of the published books for
each author.
Args:
author_list (List): List with author ids.
Returns:
authors_dict (Dict): Dictionary with title for each author.
"""
# Create a joined string of the authors
authors = ", ".join(map(str, authors_list))
# Query to get all published books for a given list of authors
query = f"""
select a.author_id,
a.title
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where b.author_id in ({authors})
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author_id", "title"])
# Transform the dataframe into a dictionary
authors_dict = df.groupby("author_id")["title"].apply(list).to_dict()
return authors_dict
Example:
# Given list of authors
authors_list = [13990676, 15194644, 19126716]
# Run function
authors_dict = authors_dict(authors_list)
authors_dict
{13990676: ['Quick Reference Glossary of Eyecare Terminology',
'Pocket Glossary of Ophthalmologic Terminology',
'A History of Modern IOLs',
'Ophthamalogic Glossary'],
15194644: ["Evaluation of the Mountain Wave Forecast Model's Stratospheric Turbulence Simulations"],
19126716: ['Palgrave Advances in Irish History',
'Tribades, Tommies and Transgressives; History of Sexualities: Volume I']}
We added the following assumption: We only consider authors with more than 1000 reviews for a better representation of our results. Using this assumption we select only the authors based on a sub-query that takes the first 20 authors if we sort by average_rating in authors in descending order. Based on that we then order by the title length in descending order and select the first 5 entries for a better visualization. To get the longest published title of all books we just order all books by title length in descending order and again selecting the first 5 entries for a better visualization.
# Query to get the longest published title of the top 20 authors regarding their average rating
query1 = f"""
select b.title,
b.author_id,
length(b.title) as title_length
from book_to_author a
inner join books b
on a.book_id = b.id and a.author_id = b.author_id
where b.author_id in (
select id
from authors
where ratings_count > 1000
order by average_rating desc
limit 20
)
order by length(b.title) desc
limit 5
"""
# Execute query and fetch results
cursor.execute(query1)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df1 = pd.DataFrame(output, columns=["title", "author_id", "title_length"])
# Query to get longest published title of all books
query2 = f"""
select a.title,
a.author_id,
length(a.title) as title_length
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
order by length(a.title) desc
limit 5
"""
# Execute query and fetch results
cursor.execute(query2)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df2 = pd.DataFrame(output, columns=["title", "author_id", "title_length"])
The five longest book title among the books of the top 20 authors regarding their average rating:
df1
| title | author_id | title_length | |
|---|---|---|---|
| 0 | In Conversation with God: Feasts, January-June v. 6: Meditations for Each Day of the Year: Feasts, Jan-June Vol 6 | 5519124 | 113 |
| 1 | 24-Carat Commercials for Kids: Everything Kids Need to Know to Break Into Commercials | 175016 | 85 |
| 2 | Optimized Nutrition:: A strategic guide for the martial artist (Volume 1) | 7263689 | 73 |
| 3 | Optimized Nutrition: Vol. 2: Customized workouts & Meal plans (Volume 2) | 7263689 | 72 |
| 4 | Just a Book of Poems for the Coffee Table- Vol. 2 (Volume 2) | 7263689 | 60 |
The five longest books overall:
df2
| title | author_id | title_length | |
|---|---|---|---|
| 0 | A Commentary and Review of Montesquieu's Spirit of Laws, Prepared for Press from the Original Manuscript in the Hands of the Publisher (1811): To Which Are Annexed, Observations on the Thirty-First Book, by the Late M. Condorcet. and Two Letters of Hel... | 7545655 | 255 |
| 1 | Confessions of Shameless Self-Promoters: Great Marketing Gurus Share Their Innovative, Proven, and Low-Cost Marketing Strategies to Maximize Your Success!: Great Marketing Gurus Share Their Innovative, Proven, and Low-Cost Marketing Strategies to Maximize | 21095542 | 255 |
| 2 | A General Introduction to Domesday Book: Accompanied by Indexes of the Tenants-In-Chief, and Under-Tenants, at the Time of the Survey: As Well as of the Holders of Lands Mentioned in Domesday Anterior to the Formation of That Record: With an Abstract o... | 120507 | 255 |
| 3 | Mare Clausum. of the Dominion, Or, Ownership of the Sea. Two Books: In the First, Is Shew'd That the Sea, by the Law of Nature, or Nations, Is Not Common to All Men But Capable of Private Dominion or Proprietie as Well as the Land in the Second, Is Pro... | 220015 | 255 |
| 4 | The Partition Omnibus: Comprising Prelude to Partition: The Indian Muslims and the Imperial System of Control 1920 - 1932. the Origins of the Partition of India 1936 - 1947 Divide and Quit: An Eyewitness Account of the Partition of India with Contribut... | 15772721 | 255 |
Answer: The longest book title among the books of the top 20 authors regarding their average rating is In Conversation with God: Feasts, January-June v. 6: Meditations for Each Day of the Year: Feasts, Jan-June Vol 6. One of the longest book title of all books is A Commentary and Review of Montesquieu's Spirit of Laws, Prepared for Press from the Original Manuscript in the Hands of the Publisher (1811): To Which Are Annexed, Observations on the Thirty-First Book, by the Late M. Condorcet. and Two Letters of Hel....
For this we can easily sort by the title length in ascending order and selecting the first 10 entries for a better visualization.
# Query to get shortest published title of all books
query = f"""
select a.title,
a.author_id
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
order by length(title) asc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["title", "author_id"])
df
| title | author_id | |
|---|---|---|
| 0 | a | 1203 |
| 1 | K | 71071 |
| 2 | M | 255080 |
| 3 | M | 375543 |
| 4 | E | 160026 |
| 5 | I | 784864 |
| 6 | A | 28367 |
| 7 | X | 4519325 |
| 8 | R | 2972853 |
| 9 | E | 3077759 |
Answer: The result returns titles with a length of 1 assuming that those are mistakes. We are adding a filter extracting titles only with a length larger than 1 assuming that one letter is not enough.
# Query to get shortest published title of all books
query = f"""
select a.title,
a.author_id
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where length(title) > 1
order by length(title) asc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["title", "author_id"])
df
| title | author_id | |
|---|---|---|
| 0 | V. | 235 |
| 1 | JR | 15991 |
| 2 | Oh | 86787 |
| 3 | Rx | 106823 |
| 4 | Do | 110257 |
| 5 | Be | 110257 |
| 6 | Os | 295910 |
| 7 | 27 | 109299 |
| 8 | Us | 7755518 |
| 9 | Up | 62910 |
Alternative answer: One of the shortest overall book title in the dataset is V..
For this we are first selecting the top 10 most influential authors regarding their fan count in a sub-query by sorting by fans_count in descending order. Based on this result we select all books from the authors and group by the id in authors to create a new group variable books_count which we then use as a tie breaker by sorting first by fans_count in descending order and afterwards books_count also in descending order.
# Query to get a list with the top 10 authors concerning the number of fans and using the number of books as a tie
# breaker
query = """
select a.id,
a.name,
a.fans_count,
count(c.id) as books_count
from (
select id,
name,
fans_count
from authors
order by fans_count desc
limit 10
) a
inner join book_to_author b
on a.id = b.author_id
inner join books c
on b.book_id = c.id and c.author_id = a.id
group by a.id
order by fans_count desc, books_count desc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author_id", "author_name", "fans_count", "books_count"])
df
| author_id | author_name | fans_count | books_count | |
|---|---|---|---|---|
| 0 | 3389 | Stephen King | 766035 | 105 |
| 1 | 4039811 | Veronica Roth | 455358 | 14 |
| 2 | 23470 | Bill Gates | 446876 | 2 |
| 3 | 5430144 | Colleen Hoover | 356633 | 10 |
| 4 | 15872 | Rick Riordan | 339346 | 35 |
| 5 | 3780 | James Patterson | 317040 | 160 |
| 6 | 1406384 | John Green | 282537 | 8 |
| 7 | 12855 | Neil deGrasse Tyson | 277850 | 5 |
| 8 | 150038 | Cassandra Clare | 238560 | 45 |
| 9 | 194416 | Mindy Kaling | 234647 | 2 |
Answer: Stephen King is the most influential author with 766.035 fans and 105 published books.
For the following questions we are going to use the aquired author_ids from the previous result and sort by the title length in descending order and selecting the first 10 entries for a better representation.
# Query to get a list with the top 10 longest series titles from the top 10 authors concerning the number of fans
# and using the number of books as a tie breaker
query = """
select distinct b.title as series_title
from books a
inner join series b
on a.series_id = b.id
inner join book_to_author c
on a.id = c.book_id
and c.author_id in (3389,
4039811,
23470,
5430144,
15872,
3780,
1406384,
12855,
150038,
194416)
where a.series_id > 0
order by length(b.title) desc
limit 10
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["series_title"])
df
| series_title | |
|---|---|
| 0 | Percy Jackson and the Olympians: The Graphic Novels |
| 1 | Percy Jackson & Kane Chronicles Crossover |
| 2 | The Heroes of Olympus: The Graphic Novels |
| 3 | The Year's Best Dark Fantasy & Horror |
| 4 | The Mortal Instruments: Graphic Novel |
| 5 | Sherlock Holmes Mysteries (Barry Day) |
| 6 | Magnus Chase and the Gods of Asgard |
| 7 | Tales from the Shadowhunter Academy |
| 8 | The Year's Best Fantasy and Horror |
| 9 | The Best American Mystery Stories |
Answer: The longest series name among he top 10 authors concerning the number of fans is Percy Jackson and the Olympians: The Graphic Novels.
We approached this problem the same as we analyzed the different languages. First we extracted a list with all unique values for the field format and narrowed it down to the top five and mapped all the other formats as Other and if the field is empty as NA. And again we use the list of author_ids we acquired in the previous result. We then group by the name of author and our newly defined column format. At the end we present our results in a box plot.
# Query to get a list with the published formats from the top 10 authors concerning the number of fans and using
# the number of books as a tie breaker
query = """
select a.name,
case when c.format in ('Hardcover',
'Paperback',
'Kindle Edition',
'ebook',
'Audio CD') then c.format
when c.format is '' then 'NA'
else 'Other' end as format,
count(a.name) as count
from authors a
inner join book_to_author b
on a.id = b.author_id
and c.author_id in (3389,
4039811,
23470,
5430144,
15872,
3780,
1406384,
12855,
150038,
194416)
inner join books c
on b.book_id = c.id and c.author_id = a.id
group by a.name, format
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author_name", "format", "count"])
# Calculate the total count for each author
df['total_count'] = df.groupby("author_name")["count"].transform("sum")
# Calculate the percentage of each format for each author
df['percentage'] = (df['count']/df['total_count'])*100
# Plot results in a Boxplot
fig = px.box(df,
x="format",
y="percentage",
title="Distribution of Formats Among the Top 10 Most Influential Authors",
category_orders={"format": ["Paperback",
"Hardcover",
"ebook",
"Kindle Edition",
"Audio CD",
"Other"]})
fig.update_xaxes(title_text="Format")
fig.update_yaxes(title_text="Percentage (%)")
fig.show("notebook")
Analysis: Most of the books were published as paperbacks, hardcovers, and ebooks.
query = f"""
select case when c.gender in ('male', 'Male') then 'Male'
when c.gender in ('female', 'woman', 'Femme') then 'Female'
end as genders_groups,
count(*) as gender_count, sum(c.average_rating), sum(c.fans_count), sum(c.ratings_count), sum(c.text_reviews_count)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by genders_groups
order by genders_groups desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
columns = ["Gender", "Counts", "Average_Rating", "Fans_Count", "Ratings_Count", "Text_Reviews_Count"]
df = pd.DataFrame(output, columns = columns)
df = df[0:2]
df.plot(x = "Gender", y = "Counts", kind = 'bar')
<Axes: xlabel='Gender'>
In the data set, we have many types of genres, some with a very large level of aggregation. We are aware of the importance of knowing how to correctly define each type of gender, but in this case, we do not have enough knowledge to be able to adequately group all the categories that existed, which is why we have decided to separate between men and women.
In this case, we can see how we have twice as many male representation as female representation. If we look at the articles read, the need for sociology experts to help with data labeling stands out. In this case, we do not believe that there is a bias due to an interpretation in the labeling of the data.
We think it is more of a historical question, if we look at the years we have in the data set, we can find books from many, many years ago. In this case, it is evident that the figure of women has not been at the same level as men throughout history. There have been cases of women who could not write, or who wrote but signed their authorship as if they were men. Therefore, it is evident that there is this bias that results in there being many more male writers than female writers. All in all, it would be normal for this bias to decrease the closer we are to the present.
We have taken the proportion of men and women for every five years since 1920 and we can see how around 2000 women already exceeded 30%, with women surpassing men in some years:
for year in range(1920, 2020, 5):
query = f"""
select case when c.gender in ('male', 'Male') then 'Male'
when c.gender in ('female', 'woman', 'Femme') then 'Female'
end as genders_groups,
count(*) as gender_count, sum(c.average_rating), sum(c.fans_count), sum(c.ratings_count), sum(c.text_reviews_count), substring(publication_date, 0, 5) as YEAR
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where YEAR = "{year}"
group by genders_groups
order by genders_groups desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
columns = ["Gender", "Counts", "Average_Rating", "Fans_Count", "Ratings_Count", "Text_Reviews_Count", "Year"]
df = pd.DataFrame(output, columns = columns)
df = df[0:2]
total = df["Counts"][0] + df["Counts"][1]
print(year)
print("Males: " + str(df["Counts"][0]*100/total)[:4] + "%")
print("Females: " + str(df["Counts"][1]*100/total)[:4] + "%")
print("---------------------------------")
1920 Males: 89.1% Females: 10.8% --------------------------------- 1925 Males: 88.2% Females: 11.7% --------------------------------- 1930 Males: 92.0% Females: 8.0% --------------------------------- 1935 Males: 88.0% Females: 12.0% --------------------------------- 1940 Males: 80.0% Females: 20.0% --------------------------------- 1945 Males: 88.8% Females: 11.1% --------------------------------- 1950 Males: 71.7% Females: 28.2% --------------------------------- 1955 Males: 80.5% Females: 19.4% --------------------------------- 1960 Males: 92.8% Females: 7.14% --------------------------------- 1965 Males: 84.5% Females: 15.4% --------------------------------- 1970 Males: 87.4% Females: 12.5% --------------------------------- 1975 Males: 85.8% Females: 14.1% --------------------------------- 1980 Males: 81.5% Females: 18.4% --------------------------------- 1985 Males: 79.1% Females: 20.8% --------------------------------- 1990 Males: 72.4% Females: 27.5% --------------------------------- 1995 Males: 66.8% Females: 33.1% --------------------------------- 2000 Males: 64.9% Females: 35.0% --------------------------------- 2005 Males: 61.4% Females: 38.5% --------------------------------- 2010 Males: 61.3% Females: 38.6% --------------------------------- 2015 Males: 44.3% Females: 55.6% ---------------------------------
Let's now look at the data with the average, to be able to compare both genders.
query = f"""
select case when c.gender in ('male', 'Male') then 'Male'
when c.gender in ('female', 'woman', 'Femme') then 'Female'
end as genders_groups,
count(*) as gender_count, avg(c.average_rating), avg(c.fans_count), avg(c.ratings_count), avg(c.text_reviews_count)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
group by genders_groups
order by genders_groups desc
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
columns_mean = ["Gender", "Counts", "mean(Average_Rating)", "mean(Fans_Count)", "mean(Ratings_Count)", "mean(Text_Reviews_Count)"]
df_mean = pd.DataFrame(output, columns = columns_mean)
df_mean = df_mean[0:2]
df_mean
| Gender | Counts | mean(Average_Rating) | mean(Fans_Count) | mean(Ratings_Count) | mean(Text_Reviews_Count) | |
|---|---|---|---|---|---|---|
| 0 | Male | 348138 | 3.855137 | 945.194495 | 54912.428261 | 2445.378637 |
| 1 | Female | 243668 | 3.883526 | 999.069188 | 51584.929006 | 3074.769625 |
The prompt that I have used is:
Explain the concept of bias in machine learning and the challenges associated with unbalanced datasets. Provide insights into methods and techniques to mitigate bias and handle imbalanced data effectively.
The answer has been:
Bias in machine learning refers to the presence of systematic and consistent errors in the model's predictions. These errors can occur due to various reasons, and one common issue is the presence of unbalanced datasets. Unbalanced datasets are datasets where the number of examples in each class or category is significantly unequal. Here's an explanation of both concepts and techniques to mitigate bias and handle unbalanced data effectively:
Concept of Bias in Machine Learning:
Challenges Associated with Unbalanced Datasets:
Methods and Techniques to Mitigate Bias and Handle Unbalanced Data:
Resampling Techniques:
Synthetic Data Generation:
Algorithm Selection:
Cost-sensitive Learning:
Ensemble Methods:
Evaluation Metrics:
Bias Detection and Fairness Evaluation:
Collect More Data:
Mitigating bias and handling unbalanced data is an essential aspect of responsible and effective machine learning. The choice of technique depends on the specific problem and dataset, and a combination of methods is often used to achieve the best results while ensuring fairness and accuracy.
query = f""" select case when c.gender in ('male', 'Male') then 'Male' when c.gender in ('female', 'woman', 'Femme') then 'Female' end as genders_groups, count(*) as gender_count, sum(c.average_rating), sum(c.fans_count), sum(c.ratings_count), sum(c.text_reviews_count) from books a inner join book_to_author b on a.id = b.book_id and b.author_id = a.author_id inner join authors c on c.id = b.author_id group by genders_groups order by genders_groups desc """
cursor.execute(query) output = cursor.fetchall()
columns = ["Gender", "Counts", "Average_Rating", "Fans_Count", "Ratings_Count", "Text_Reviews_Count"] df = pd.DataFrame(output, columns = columns) df = df[0:2]
df.plot(x = "Gender", y = "Counts", kind = 'bar')
In the data set, we have many types of genres, some with a very large level of aggregation. We are aware of the importance of knowing how to correctly define each type of gender, but in this case, we do not have enough knowledge to be able to adequately group all the categories that existed, which is why we have decided to separate between men and women.
In this case, we can see how we have twice as many male representation as female representation. If we look at the articles read, the need for sociology experts to help with data labeling stands out. In this case, we do not believe that there is a bias due to an interpretation in the labeling of the data.
We think it is more of a historical question, if we look at the years we have in the data set, we can find books from many, many years ago. In this case, it is evident that the figure of women has not been at the same level as men throughout history. There have been cases of women who could not write, or who wrote but signed their authorship as if they were men. Therefore, it is evident that there is this bias that results in there being many more male writers than female writers. All in all, it would be normal for this bias to decrease the closer we are to the present.
We have taken the proportion of men and women for every five years since 1920 and we can see how around 2000 women already exceeded 30%, with women surpassing men in some years:
for year in range(1920, 2020, 5): query = f""" select case when c.gender in ('male', 'Male') then 'Male' when c.gender in ('female', 'woman', 'Femme') then 'Female' end as genders_groups, count(*) as gender_count, sum(c.average_rating), sum(c.fans_count), sum(c.ratings_count), sum(c.text_reviews_count), substring(publication_date, 0, 5) as YEAR from books a inner join book_to_author b on a.id = b.book_id and b.author_id = a.author_id inner join authors c on c.id = b.author_id where YEAR = "{year}" group by genders_groups order by genders_groups desc """
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
columns = ["Gender", "Counts", "Average_Rating", "Fans_Count", "Ratings_Count", "Text_Reviews_Count", "Year"]
df = pd.DataFrame(output, columns = columns)
df = df[0:2]
total = df["Counts"][0] + df["Counts"][1]
print(year)
print("Males: " + str(df["Counts"][0]*100/total)[:4] + "%")
print("Females: " + str(df["Counts"][1]*100/total)[:4] + "%")
print("---------------------------------")
Let's now look at the data with the average, to be able to compare both genders.
query = f""" select case when c.gender in ('male', 'Male') then 'Male' when c.gender in ('female', 'woman', 'Femme') then 'Female' end as genders_groups, count(*) as gender_count, avg(c.average_rating), avg(c.fans_count), avg(c.ratings_count), avg(c.text_reviews_count) from books a inner join book_to_author b on a.id = b.book_id and b.author_id = a.author_id inner join authors c on c.id = b.author_id group by genders_groups order by genders_groups desc """
cursor.execute(query) output = cursor.fetchall()
columns_mean = ["Gender", "Counts", "mean(Average_Rating)", "mean(Fans_Count)", "mean(Ratings_Count)", "mean(Text_Reviews_Count)"] df_mean = pd.DataFrame(output, columns = columns_mean) df_mean = df_mean[0:2]
df_mean
We can see how in this case, both genres are equally valued and have an almost equal number of fans. It is also true that this great imbalance in the data set may be due to the quality of the data; there were many values without gender assignment or many genders, so the bias would surely be less.
The prompt that I have used is:
Explain the concept of bias in machine learning and the challenges associated with unbalanced datasets. Provide insights into methods and techniques to mitigate bias and handle imbalanced data effectively.
The answer has been:
Bias in machine learning refers to the presence of systematic and consistent errors in the model's predictions. These errors can occur due to various reasons, and one common issue is the presence of unbalanced datasets. Unbalanced datasets are datasets where the number of examples in each class or category is significantly unequal. Here's an explanation of both concepts and techniques to mitigate bias and handle unbalanced data effectively:
Concept of Bias in Machine Learning:
Challenges Associated with Unbalanced Datasets:
Methods and Techniques to Mitigate Bias and Handle Unbalanced Data:
Resampling Techniques:
Synthetic Data Generation:
Algorithm Selection:
Cost-sensitive Learning:
Ensemble Methods:
Evaluation Metrics:
Bias Detection and Fairness Evaluation:
Collect More Data:
Mitigating bias and handling unbalanced data is an essential aspect of responsible and effective machine learning. The choice of technique depends on the specific problem and dataset, and a combination of methods is often used to achieve the best results while ensuring fairness and accuracy.
As ChatGPT says, it is one of the most common and most difficult problems to solve. In this case, the data we collect is nothing more than a sample of society and therefore, we must take into account what criteria we have for collecting the samples, since a poor choice can lead to false information and models with a important bias. I think it would be best in the labeling and data collection process to have expert advice as well as use appropriate techniques to prevent class imbalance.
Before going into details how we solved the problem we want to state our expectation: We expect for books of a series a smaller time gap between two subsequent publications because an author works very likely simultaneous on multiple books of the series or almost immediately starts working on the next one as soon as he/she releases the most current one compare to two books that are not belonging to a series.
First we extract the top 10 authors concering the numbers of fans by ordering our authors table by the fans_count column in descending order and only selecting the first 10 rows. After that we are inner joining with the book_to_author and books tables making sure we only select books with a valid original_publication_date and creating a rank variable partitioning by author_id and ordering by original_publication_date by ascending order. In addition, we are creating a series_ind with the value 1 if a book belongs to a series and 0 otherwise.
# Query to get a list with the top 10 authors concerning the number of fans, splitting the books in series in
# non-series and ranking by the original publication date in ascending order for every author
query = """
select c.author_id,
date(c.original_publication_date),
case when c.series_id > 0 then 1 else 0 end as series_ind,
row_number() over(partition by c.author_id order by date(c.original_publication_date) asc) rank
from book_to_author b
inner join books c
on b.book_id = c.id and c.author_id in (select id from authors order by fans_count desc limit 10)
and b.author_id in (select id from authors order by fans_count desc limit 10)
order by c.author_id
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["author_id", "original_publication_date", "series_ind", "rank"])
# Convert the original_publication_date column to the date format
df["original_publication_date"] = pd.to_datetime(df["original_publication_date"], format="%Y-%m-%d")
df
| author_id | original_publication_date | series_ind | rank | |
|---|---|---|---|---|
| 0 | 3389 | 1974-04-05 | 0 | 1 |
| 1 | 3389 | 1975-10-17 | 0 | 2 |
| 2 | 3389 | 1977-01-28 | 1 | 3 |
| 3 | 3389 | 1978-01-01 | 0 | 4 |
| 4 | 3389 | 1978-10-03 | 0 | 5 |
| ... | ... | ... | ... | ... |
| 381 | 5430144 | 2014-08-05 | 0 | 6 |
| 382 | 5430144 | 2014-11-24 | 1 | 7 |
| 383 | 5430144 | 2015-05-17 | 1 | 8 |
| 384 | 5430144 | 2015-11-10 | 0 | 9 |
| 385 | 5430144 | 2016-01-19 | 1 | 10 |
386 rows × 4 columns
In the next step we split the dataframe using the series_ind in two groups one including only series and one with all the others. At last we calculate the time gap between subsequent publications for each author and group.
# Split by series_ind in two groups
non_series_df = df[df["series_ind"] == 0].copy()
series_df = df[df["series_ind"] == 1].copy()
# Calculating the time gap between subsequent publications for each author for non_series_df
# Add 0 for the very first book
non_series_diff = [0]
for i in range(0, len(non_series_df)-1):
# Check if the next book belongs to the current author if yes, calculate gap
if non_series_df.iloc[i, 0] == non_series_df.iloc[i+1, 0]:
non_series_diff.append((non_series_df.iloc[i+1, 1] - non_series_df.iloc[i, 1]).days)
else:
non_series_diff.append(0)
# Calculating the time gap between subsequent publications for each author for series_df
# Add 0 for the very first book
series_diff = [0]
for i in range(0, len(series_df)-1):
# Check if the next book belongs to the current author if yes, calculate gap
if series_df.iloc[i, 0] == series_df.iloc[i+1, 0]:
series_diff.append((series_df.iloc[i+1, 1] - series_df.iloc[i, 1]).days)
else:
series_diff.append(0)
# Add both lists to the dataframes
non_series_df["date_diff"] = non_series_diff
series_df["date_diff"] = series_diff
# Remove all entries with a time gap of 0.
# Case 1: It is the very first book and therefore it is not possible to calculate a gap to the previous book.
# Case 2: We only have the year or month/year for the publication date and both books were written in the same
# period. We exclude those values to not wrongly scew our data, because it is nearly impossible
# that an author released two books at the same time.
non_series_df = non_series_df[non_series_df["date_diff"] != 0]
series_df = series_df[series_df["date_diff"] != 0]
print(f"# of series books: {len(series_df)}")
print(f"# of non-series books: {len(non_series_df)}")
print(f"Sum of time gaps between subsequent publications for all series books: {series_df['date_diff'].sum()}")
print(f"Sum of time gaps between subsequent publications for all non-series books: {non_series_df['date_diff'].sum()}")
# of series books: 212 # of non-series books: 125 Sum of time gaps between subsequent publications for all series books: 37396 Sum of time gaps between subsequent publications for all non-series books: 41483
We are going to calculate the average gap of both groups the following way:
$$ \text{Average time gap for series books} = \frac{\text{Sum of time gaps between subsequent publications for all series books}}{\text{# of series books}} $$$$ \text{Average time gap for non-series books} = \frac{\text{Sum of time gaps between subsequent publications for all non-series books}}{\text{# of non-series books}} $$Using the numbers we acquired from the previous step we get:
\begin{align} \text{Average time gap for series books} = \frac{37396 \text{ days}}{212} = \text{176,4 days} \\\\ \text{Average time gap for non-series books} = \frac{41483 \text{ days}}{125} = \text{331,9 days} \end{align}Conclusion: The result matches with our expectation from the beginning that the time gap between two subsequent publications is smaller for series than non-series books.
For this we are going to use a similiar query as for the previous question but we are going to include a year variable for filtering all published books up to the given year.
# Year up to which published books should be filtered
year = 2020
# Query to get a list with the top 10 authors and the count of books for every year
query = f"""
with tmp as
(
select c.author_id,
substr(c.original_publication_date, 1, 4) as year,
count(*) as count_books
from book_to_author b
inner join books c
on b.book_id = c.id
and c.author_id in (select id from authors order by fans_count desc limit 10)
and b.author_id in (select id from authors order by fans_count desc limit 10)
group by c.author_id, year
)
select a.author_id,
b.name,
a.year,
a.count_books
from tmp as a
inner join authors b
on a.author_id = b.id
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
df = pd.DataFrame(output, columns=["Author ID", "Author", "Year", "Count of Books"])
# Change datatype of year column to int
df["Year"] = df["Year"].astype(int)
# Filter up to the given year
df = df[df["Year"] <= year]
# Plotting results in a Line Plot
fig = px.line(df,
x='Year',
y='Count of Books',
color='Author',
title='Cumulative Count of Books per Author and Year')
fig.show("notebook")
Analysis: The authors look like that they are contemporary with each other. We can see noticeable trends among the authors for the year 2009 and around 2013.
For this we are going to extract all books that have more than 30% of their total ratings inside the 5-star and 4-star ratings, sum them and divide by the total number of books:
We can easily use our books table for that by calculating the 5-star ratings ratio using the rating_dist_5, rating_dist_4, and ratings_count column and defining a classificator that returns 1 if the ratio is larger than 30% and 0 otherwise. By summing this new field we get the total number of books with > 30% 5-star ratings. In addition we count all book ids to get the total number of books.
# Query to get # of books with > 30% 5-star and 4-star ratings and total # of books
query = """
select sum(case when (a.rating_dist_5 + a.rating_dist_4) / a.ratings_count > 0.3 then 1 else 0 end) as count_over_30,
count(a.id) as total_count
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df = pd.DataFrame(output, columns=['count_over_30', 'total_count'])
# Calculate probability
result_df['probability'] = (result_df['count_over_30'] /
result_df['total_count']*100).round(2).astype(str) + '%'
print(result_df.to_string(index=False))
count_over_30 total_count probability
104664 780477 13.41%
Answer: The probability that a book has over 30% of the ratings above 4 is 13.41%.
For this we are going to extract all authors that wrote a new book within two years from their last work and divide by the total number of authors:
We are going to inner join our mapped table book_to_author with the books table and create a rank variable partitioning by the author_id and ordering by the original_publication_date by descending order. With this logic we are going to have the newest book from every author ranked as 1 and its previous work as 2. In the next step we are only extracting the entries with a rank equal to 1 or 2 and calculating the difference in seconds between both dates. After that we devide the result by 86400 to transform the number into days. At last we difine a classificator that returns 1 if the difference is <= 730 days and 0 otherwise. By summing this new field we get the total number of authors that published a new book within two years from its last work.
Note: We are not checking if a year is a leap year and using 365 days for every year.
# Query to get # of authors with new book within two years from their last work
query1 = """
with tmp as
(
select a.author_id,
a.original_publication_date,
date(a.original_publication_date) as original_publication_date,
row_number() over(partition by a.author_id order by date(a.original_publication_date) desc) rank
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
)
select sum(case when (strftime('%s', a.original_publication_date) - strftime('%s', b.original_publication_date)) / 86400 <= 730 then 1 else 0 end) as result
from (select author_id, original_publication_date from tmp where rank = 1) a
inner join (select author_id, original_publication_date from tmp where rank = 2) b
on a.author_id = b.author_id
"""
# Execute query and fetch results
cursor.execute(query1)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df1 = pd.DataFrame(output, columns=['count_new_book_within_two_years'])
# Query to get total # of authors
query2 = """
select count(distinct c.id)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
"""
# Execute query and fetch results
cursor.execute(query2)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df2 = pd.DataFrame(output, columns=['total_count'])
# Concat both Dataframes to one
result_df = pd.concat([result_df1, result_df2], axis=1)
# Calculate probability
result_df['probability'] = (result_df['count_new_book_within_two_years'] /
result_df['total_count']*100).round(2).astype(str) + '%'
print(result_df.to_string(index=False))
count_new_book_within_two_years total_count probability
63619 178076 35.73%
Answer: The probability that an author publishes a new book within two years from its last work is 35.73%.
For this we are going to extract all books that are included inside the worst books list and having more than 700 pages and divide by the total number of books with more than 700 pages:
To get the number of worst books with more than 700 pages we are going to inner join our books table with the worst_books_list table and only select the entries with more than 700 pages. To get the total number of books we can just count all entries of our books table.
# Query to get # of worst books with more than 700 pages
query1 = """
select count(*)
from worst_books_list a
inner join books b
on a.book_id = b.id
inner join book_to_author c
on b.id = c.book_id and c.author_id = b.author_id
inner join authors d
on d.id = c.author_id
where b.num_pages > 700
"""
# Execute query and fetch results
cursor.execute(query1)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df1 = pd.DataFrame(output, columns=['count_worst_books_with_more_than_700_pages'])
# Query to get total # of books with more than 700 pages
query2 = """
select count(*)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
where num_pages > 700
"""
# Execute query and fetch results
cursor.execute(query2)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df2 = pd.DataFrame(output, columns=['total_count_with_more_than_700_pages'])
# Concat both Dataframes to one
result_df = pd.concat([result_df1, result_df2], axis=1)
# Calculate probability
result_df['probability'] = (result_df['count_worst_books_with_more_than_700_pages'] /
result_df['total_count_with_more_than_700_pages']*100).round(2).astype(str) + '%'
print(result_df.to_string(index=False))
count_worst_books_with_more_than_700_pages total_count_with_more_than_700_pages probability
99 16401 0.6%
Answer: The probability that a book is included in the list of worst books, knowing it has more than 700 pages, is 0.6%.
In the following queries we are going to calculate a few numbers we need to solve this problem.
# Query to get total # of worst books
query = """
select count(*)
from worst_books_list
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df = pd.DataFrame(output, columns=["total_count_worst_books"])
print(result_df.to_string(index=False))
total_count_worst_books
7393
# Query to get # of worst books with less than 700 pages
query = """
select count(*)
from worst_books_list a
inner join books b
on a.book_id = b.id
inner join book_to_author c
on b.id = c.book_id and c.author_id = b.author_id
inner join authors d
on d.id = c.author_id
where b.num_pages <= 700
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df = pd.DataFrame(output, columns=["total_count_worst_books_with_less_than_700_pages"])
print(result_df.to_string(index=False))
total_count_worst_books_with_less_than_700_pages
2938
# Query to get # of books with less than 700 pages and not in worst books list
query = """
select count(*)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
left join worst_books_list d
on a.id = d.book_id
where num_pages <= 700
and d.title is null
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df = pd.DataFrame(output, columns=["total_count_not_worst_books_with_less_than_700_pages"])
print(result_df.to_string(index=False))
total_count_not_worst_books_with_less_than_700_pages
761138
# Query to get # of books with more than 700 pages and not in worst books list
query = """
select count(*)
from books a
inner join book_to_author b
on a.id = b.book_id and b.author_id = a.author_id
inner join authors c
on c.id = b.author_id
left join worst_books_list d
on a.id = d.book_id
where num_pages > 700
and d.title is null
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_df = pd.DataFrame(output, columns=["total_count_not_worst_books_with_more_than_700_pages"])
print(result_df.to_string(index=False))
total_count_not_worst_books_with_more_than_700_pages
16302
We are going to solve this problem in two ways.
# Create 2x2 contingency table
observed_values = np.array([[99, 2938], [16302, 761138]])
# Run a chi-squared test
chi2, p, dof, expected = chi2_contingency(observed_values)
# Reformate p-value for better representation
p_str = "{:.8f}".format(p)
print(f"p: {p_str}")
# Check if p-value is significant or not
if p < 0.05:
print("Reject the null hypothesis: X and Y are not independent.")
else:
print("Fail to reject the null hypothesis: X and Y are independent.")
p: 0.00001102 Reject the null hypothesis: X and Y are not independent.
Answer: We determined over both ways that the events X='Being Included in The Worst Books of All Time list' and Y='Having more than 700 pages' are not independent.
# get the average rating of the books considered the worst ones
cursor.execute("""
SELECT AVG(b.num_pages)
FROM books b
JOIN worst_books_list w ON b.id = w.book_id
""")
average_wbl = cursor.fetchone()[0]
# get the average rating of the books not considered the worst ones
cursor.execute("""
SELECT AVG(b.num_pages)
FROM books b
LEFT JOIN worst_books_list w ON b.id = w.book_id
WHERE w.book_id IS NULL
""")
average_notwbl = cursor.fetchone()[0]
# get the average rating of the books
cursor.execute("""
SELECT AVG(num_pages)
FROM books
""")
average_tot= cursor.fetchone()[0]
print("Average number of pages of the worst books: ",np.round(average_wbl,2))
print("Average number of pages of the best books: ",np.round(average_notwbl,2))
print("Average number of pages of all the books: ",np.round(average_tot,2))
Average number of pages of the worst books: 346.47 Average number of pages of the best books: 262.7 Average number of pages of all the books: 262.84
As we can see, the average number of pages in the worst books is larger than the overall average number of pages in books.
# query that extracts the average rating of the english books
cursor.execute("""
SELECT average_rating
FROM books
WHERE language = 'eng'
""")
ratings_eng = [row[0] for row in cursor.fetchall()]
# query that extracts the average rating of the non-english books
cursor.execute("""
SELECT average_rating
FROM books
WHERE language != 'eng' OR language IS NULL
""")
ratings_non_eng = [row[0] for row in cursor.fetchall()]
# we get the p-value using a t-test
t, p_value = ttest_ind(ratings_eng, ratings_non_eng)
# H0: no difference
# H1: there's a difference
# print p-value
print(f"P-value: {p_value:.4f}")
P-value: 0.0000
In the null hypothesis we assume that the average ratings of both types have no significant difference. In the alternative one we sugest that there is in fact a notable difference between the English and non-English ratings.
The p-value is lower than $\alpha$ ( = 0.05), the signifance level meaning we reject the null hypothesis so we can affirm that the average rating of the books written in English is different from the ones that are not written in English.
# mean,mode,median, q1 and 13
# for English-Books
mean_e= np.mean(ratings_eng)
mode_e = statistics.mode(ratings_eng)
med_e= np.median(ratings_eng)
q1_e,q3_e=np.percentile(ratings_eng, [25, 75])
# for non English-Books
mean_ne= np.mean(ratings_non_eng)
mode_ne = statistics.mode(ratings_non_eng)
med_ne= np.median(ratings_non_eng)
q1_ne,q3_ne=np.percentile(ratings_non_eng, [25, 75])
# printing the results
print(" ENG | NON-ENG")
print("MEAN: ", np.round(mean_e,2)," | ",np.round(mean_ne,2))
print("MODE: ",mode_e," | ",mode_ne)
print("MEDIAN: ",med_e," | ",med_ne)
print("Q1: ", np.round(q1_e,2)," | ",np.round(q1_ne,2))
print("Q3: ", np.round(q3_e,2)," | ",np.round(q3_ne,2))
# plotting boxplot
plt.boxplot([ratings_eng, ratings_non_eng], labels=['English', 'Non-English'])
plt.ylabel('Average Rating')
plt.title('Comparison between the Average Rating of English and non-English books')
plt.show()
ENG | NON-ENG MEAN: 3.77 | 2.9 MODE: 4.0 | 0.0 MEDIAN: 3.93 | 3.71 Q1: 3.65 | 0.0 Q3: 4.17 | 4.05
The comparison of the two means supports the alternative hypothesis, indicating that the English-written books have higher ratings. Consequently, at all quartile positions,the lower side of the box, the red line, and the upper side of the box, English books exhibit higher values compared to non-English ones. Graphically we can see that in the boxplot, the interquartile range is larger for the non-English books, suggesting greater variability in their ratings. Both ratings share a mode of 0, implying that a significant number of books in both categories lack a rating.
# SQL query
query = """
SELECT fans_count, text_reviews_count
FROM authors
"""
cursor.execute(query)
data = np.array(cursor.fetchall())
X = data[:, 0].reshape(-1, 1) # fans_count
y = data[:, 1] # text_reviews_count
# Linear Regression
coeff = np.polyfit(X.flatten(), y, 1)
reg = np.poly1d(coeff)
# Calculating R^2
corr_m = np.corrcoef(X.flatten(), y) # correlation matrix
corr = corr_m[0, 1]
r_sq= corr ** 2
# Printing R^2
print(f'R-squared: {r_sq:.4f}')
# Plotting
plt.scatter(X, y, color='blue')
plt.plot(X, reg(X.flatten()), color='red', linewidth=3)
plt.xlabel('Fans Count')
plt.ylabel('Text Reviews Count')
plt.title('Fans Count vs Text Reviews Count')
plt.show()
R-squared: 0.4820
As the regression line shows, there's a positive correlation between the fan count and the text reviews count. It's possible that the more fans an author has, the more reviews they will receive. However, this isn't always true, as we can see there are some popular authors with very few reviews. The R-squared is nearly 0.5 indicating there's some variability in the data but we should consider it a good fitting model. For a more detailed study, we could consider other factors such as the genre of the books written and their sales.
Hypothesis design involves formulating a statement about the population for testing, typically in the form of a null hypothesis and an alternative hypothesis. The null hypothesis asserts that the sample characteristics do not differ from the population.
If we assume that the data distribution is normal, we may opt for a parametric test, which is considered more reliable. On the other hand, a nonparametric test is employed for any type of data without making assumptions about its distribution.
We decided to compare performance between Pandas and Spark.
start_time = time.time()
# Read authors file into a Pandas dataframe
pandas_df_authors = pd.read_json(authors_file, lines=True)
# Filter only for authors with at least 100 review
pandas_df_authors = pandas_df_authors[pandas_df_authors["text_reviews_count"] >= 100]
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution time with Pandas: {elapsed_time:.2f} seconds")
Execution time with Pandas: 17.53 seconds
start_time = time.time()
# Load authors file into a Spark Dataframe
spark_df_authors = spark.read.json(authors_file)
# Filter only for authors with at least 100 review
spark_df_authors = spark_df_authors.filter(spark_df_authors["text_reviews_count"] >= 100)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution time with Spark: {elapsed_time:.2f} seconds")
[Stage 20:> (0 + 8) / 8]
Execution time with Spark: 2.07 seconds
Conclusion: Using Spark is almost 10 times quicker than Pandas!
We sticked with using Spark.
start_time = time.time()
# Load books file into a Spark Dataframe
spark_df_books = spark.read.json(books_file)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution time with Spark: {elapsed_time:.2f} seconds")
[Stage 21:==================================================> (113 + 8) / 121]
Execution time with Spark: 54.17 seconds
# Left join the authors dataframe on the books dataframe based on the author_id
joined_df = spark_df_books.join(spark_df_authors,
spark_df_books["author_id"] == spark_df_authors["id"],
how="left")
# Filter for all books that have not a match for an author
filtered_df = joined_df.filter(joined_df["name"].isNull())
print(f"Books without a match for an author: {filtered_df.count():,}")
Books without a match for an author: 2,240,774
Answer: In total 2.240.774 books have no match for an author.
We created a genre dictionary where each key represents a genre and has a list of keywords we defined before. After analyzing the about field we realized that very often the genre is directly mentioned in the text. So we included the genre name itself inside the list. This lists and dictionary are obviously expandable at any time.
# Genre dictionary
genre_dict = {"Fantasy": ["Fantasy"],
"Science Fiction": ["Science Fiction"],
"Dystopian": ["Dystopian"],
"Action & Adventure": ["Action", "Adventure"],
"Mystery": ["Mystery"],
"Horror": ["Horror"],
"Thriller & Suspense": ["Thriller", "Suspense"],
"Historical Fiction": ["Historical"],
"Romance": ["Romance"],
"Women’s Fiction": ["Women"],
"LGBTQ+": ["LGBT"],
"Contemporary Fiction": ["Contemporary"],
"Literary Fiction": ["Literary"],
"Magical Realism": ["Magical"],
"Graphic Novel": ["Graphic Novel"],
"Short Story": ["Short Story"],
"Young Adult": ["Young Adult"],
"New Adult": ["New Adult"],
"Children": ["Children"],
"Memoir & Autobiography": ["Memoir", "Autobiography"],
"Biography": ["Biography"],
"Food & Drink": ["Food", "Drink"],
"Art & Photography": ["Art", "Photography"],
"Self-help": ["Self-help", "Selfhelp"],
"History": ["History"],
"Travel": ["Travel"],
"True Crime": ["True Crime"],
"Humor": ["Humor"],
"Essay": ["Essay"],
"Guide & How-to": ["Guide", "How-to"],
"Religion & Spirituality": ["Religion", "Spirituality"],
"Humanities & Social Sciences": ["Humanities", "Social Sciences"],
"Parenting & Families": ["Families", "Family", "Parenting"],
"Science & Technology": ["Science", "Technology"],
"English": ["English"],
"German": ["German"],
"Italian": ["Italian"],
"Spanish": ["Spanish"],
"Japanese": ["Japanese"],
"French": ["French"],
"Arabian": ["Arabian"]}
We created a function that takes a given text and predefined genre dictionary and returns a list with all found genre keys in a list.
def map_genres(text: str, genre_dict: Dict) -> List:
"""
Function that takes a given text and searches for every value inside the genre dictionary and returns a list
with its keys and 'Unknown' otherwise.
Args:
text (str): Given text to be searched.
genre_dict (Dict): Dictionary with all genres and its keywords.
Returns:
found_genres (List): List with all found genres.
"""
found_genres = []
# Search every value of the key inside the text field and return a list with its keys
for key, values in genre_dict.items():
for value in values:
if value in text:
found_genres.append(key)
# Insert 'Unknown' otherwise
if not found_genres:
found_genres.append("Unknown")
return found_genres
To analyze all authors we read them in a query into a Pandas Dataframe. After that we apply our function on the about column and save the results in a new column genres. At the end we explode the genres column and transform the Pandas Dataframe into a Spark Dataframe.
# Query to get all authors and their about description
query = """
select id,
name,
about
from authors
"""
# Execute query and fetch results
cursor.execute(query)
output = cursor.fetchall()
# Save output in a pandas Dataframe
result_authors_df = pd.DataFrame(output, columns=["id", "name", "about"])
# Apply the function to create the "genre" column
result_authors_df["genres"] = result_authors_df["about"].apply(lambda x: map_genres(x, genre_dict))
result_authors_df = result_authors_df.explode("genres")
result_authors_df = result_authors_df[["id", "genres"]]
# Transform Pandas Dataframe into a Spark Dataframe
spark_df_authors = spark.createDataFrame(result_authors_df)
# Rename columns
spark_df_authors = spark_df_authors.withColumnRenamed("id", "author_id")
spark_df_authors = spark_df_authors.withColumnRenamed("genres", "author_genre")
spark_df_authors.show(10)
/Users/andre/python3.11/lib/python3.11/site-packages/pyspark/sql/pandas/conversion.py:485: FutureWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead. [Stage 31:> (0 + 1) / 1]
+---------+--------------------+ |author_id| author_genre| +---------+--------------------+ | 4| Guide & How-to| | 4| English| | 7| History| | 7|Science & Technology| | 7| English| | 10| Unknown| | 12| Unknown| | 14| English| | 16| French| | 17| English| +---------+--------------------+ only showing top 10 rows
For analyzing the description field in books we used the same genre dictionary as before and had to slighly change our function to get the genre keys because we loaded the booksfile into a Spark Dataframe due to bigger size.
def map_genres_in_spark(text: str) -> List:
"""
Function that takes a given text and searches for every value inside the genre dictionary and returns a list
with its keys and 'Unknown' otherwise.
Args:
text (str): Given text to be searched.
Returns:
found_genres (List): List with all found genres.
"""
# Split text in single words
words = text.split()
# Search for genres and save in list
found_genres = [word for word in words if word in genre_dict]
# Insert Unknown otherwise
if not found_genres:
found_genres = ["Unknown"]
return found_genres
Before applying our function we cleaned the books Spark Dataframe the same way we cleaned our database with.
# Clean the data with the same logic we applied on our database
filtered_df = spark_df_books.filter(
(col("id") > 0) &
(col("author_id") > 0) &
(col("num_pages") > 0) &
(col("num_pages").isNotNull()) &
(expr("""
(publication_date LIKE '____-__-__'
AND date_format(publication_date, 'yyyy-MM-dd') = publication_date
AND date(publication_date) <= current_date())
""") == True) &
(expr("""
(original_publication_date LIKE '____-__-__'
AND date_format(original_publication_date, 'yyyy-MM-dd') = original_publication_date
AND date(original_publication_date) <= current_date())
""") == True)
)
# Create a user defined function to search for genres
find_matching_keys_udf = udf(map_genres_in_spark, ArrayType(StringType()))
# Apply the function to create the "genre" column
filtered_df = filtered_df.withColumn("genres", find_matching_keys_udf(filtered_df["description"]))
spark_df_books = filtered_df.select("id", "author_id", explode("genres").alias("book_genre"))
# Rename column
spark_df_books = spark_df_books.withColumnRenamed("id", "book_id")
spark_df_books.show(10)
[Stage 32:> (0 + 1) / 1]
+-------+---------+----------+ |book_id|author_id|book_genre| +-------+---------+----------+ | 3| 1077326| Unknown| | 4| 1077326| Unknown| | 5| 1077326| Unknown| | 6| 1077326| Unknown| | 8| 1077326| Unknown| | 10| 1077326| Unknown| | 12| 4| Unknown| | 13| 4| Unknown| | 14| 4| Unknown| | 16| 4| Unknown| +-------+---------+----------+ only showing top 10 rows
We inner join both Spark Dataframes on author_id.
# Inner join both Dataframes by author_id
result = spark_df_books.join(spark_df_authors, "author_id", "inner")
result.show(10)
[Stage 33:=====================================================>(119 + 2) / 121]
+---------+-------+----------+--------------------+ |author_id|book_id|book_genre| author_genre| +---------+-------+----------+--------------------+ | 7| 21| History| History| | 7| 21| History|Science & Technology| | 7| 21| History| English| | 7| 24| Unknown| History| | 7| 24| Unknown|Science & Technology| | 7| 24| Unknown| English| | 7| 25| English| History| | 7| 25| English|Science & Technology| | 7| 25| English| English| | 7| 26| Unknown| History| +---------+-------+----------+--------------------+ only showing top 10 rows
After that we create a new column is_match which is 1 if we mapped the same genre as for the author and 0 otherwise.
# Define a new column is_match which is 1 if author_genre = book_genre and 0 otherwise
result = result.withColumn("is_match", when(col("book_genre") == col("author_genre"), 1).otherwise(0))
result.show(10)
[Stage 38:=====================================================>(120 + 1) / 121]
+---------+-------+----------+--------------------+--------+ |author_id|book_id|book_genre| author_genre|is_match| +---------+-------+----------+--------------------+--------+ | 7| 21| History| History| 1| | 7| 21| History|Science & Technology| 0| | 7| 21| History| English| 0| | 7| 24| Unknown| History| 0| | 7| 24| Unknown|Science & Technology| 0| | 7| 24| Unknown| English| 0| | 7| 25| English| History| 0| | 7| 25| English|Science & Technology| 0| | 7| 25| English| English| 1| | 7| 26| Unknown| History| 0| +---------+-------+----------+--------------------+--------+ only showing top 10 rows
Based on the previous results we group by author_id and create a new column has_match if we mapped at least one book correctly for an author.
# Group by author_id and define a new column has_match if at least one book has the same genre
result = result.groupBy("author_id").agg(when(sum(col("is_match")) > 0, 1).otherwise(0).alias("has_match"))
result.show(10)
[Stage 43:=====================================================>(120 + 1) / 121]
+---------+---------+ |author_id|has_match| +---------+---------+ | 7| 1| | 26| 0| | 29| 0| | 50| 0| | 54| 1| | 57| 1| | 113| 0| | 130| 1| | 136| 0| | 149| 0| +---------+---------+ only showing top 10 rows
At last we count all the authors we mapped at least one book the same genre and all the authors we had no match.
# Count the occurrences of 0 and 1
count_zeros = result.filter(col("has_match") == 0).count()
count_ones = result.filter(col("has_match") == 1).count()
print(f"Number of 0s: {count_zeros}")
print(f"Number of 1s: {count_ones}")
[Stage 61:> (0 + 8) / 8]
Number of 0s: 60885 Number of 1s: 133661
Conclusion: Using our genre dictionary on both files authors and books we have 133.661 authors where we map at least one book the same way as for the author and 60.885 authors where we either couldn't determine the genre at all or we had completly different mappings to the books he/she wrote.
The shell script is saved in the file commandline_original.sh. First you need to navigate to the file where the series.json file is saved by using the cd command and then you can run the script with the ./commandline_original.sh command (before that you have to make sure that it is executable by using the chmod +x commandline_original.sh command). As the result the top 5 series with the highest total books count among all of their associated books is saved in a json file called sorted_series_with_books_count.json plus the result is printed in the terminal:
| id | title | total_books_count |
|---|---|---|
| 308815 | The Great Courses | 100 |
| 308303 | Lonely Planet - Travel Survival Kit | 100 |
| 308300 | Lonely Planet | 100 |
| 306917 | Comet Street Kids | 100 |
| 306039 | Contributions in Political Science | 100 |
After conduncting this script we asked ChatGPT for a more robust solution with the following query:
is there a more robust solution for
jq '{id: .id, title: .title, worksLength: (.works | length)}' series.json > series_with_books_count.json
jq -s 'sort_by(.worksLength) | reverse | .[:5]' series_with_books_count.json > sorted_series_with_books_count.json
jq -r '["id", "title", "total_books_count"], (.[] | [.id, .title, .worksLength]) | @tsv' sorted_series_with_books_count.json | column -t -s $'\t'
We got the following answer:
The commands you've provided seem to be a series of operations using jq and column to manipulate and display data from a JSON file called series.json. This process appears to be designed to extract specific information from the JSON data, sort it, and format it as a table. However, there are ways to make this process more robust and efficient.
Here's an improved version of your commands:
jq -s 'map({id, title, worksLength: (.works | length)}) | sort_by(.worksLength) | reverse | .[:5]' series.json > sorted_series_with_books_count.json
jq -r '["id", "title", "total_books_count"], (.[] | [.id, .title, .worksLength]) | @tsv' sorted_series_with_books_count.json | column -t -s $'\t'
Here's what these changes do:
Use map instead of the [] array constructor in the first jq command to transform each item in the input JSON array into the desired format.
The jq command is now sorting and reversing the data in a single step, which can be more efficient.
The second jq command takes the sorted and transformed data and formats it as a tab-separated table.
Removed the intermediate step of creating a separate series_with_books_count.json file, which can save disk space and improve efficiency.
These changes should make the process more robust and easier to understand. It's important to note that the efficiency and robustness of these commands also depend on the size and complexity of your series.json file and your specific use case. If you have a large JSON file or need more complex transformations, further optimization may be necessary.
This result was saved inside the new commandline_LLM.sh file. Executing it we get the following result:
| id | title | total_books_count |
|---|---|---|
| 308815 | The Great Courses | 100 |
| 308303 | Lonely Planet - Travel Survival Kit | 100 |
| 308300 | Lonely Planet | 100 |
| 306917 | Comet Street Kids | 100 |
| 306039 | Contributions in Political Science | 100 |
Conclusion: Due to the fact that we get the same output table with both shell scripts we can assume the the solution provided by ChatGPT is correct.
First, we are going to analyze the algorithm step by step:
map parses through every line of the json file and creates the fields id, title, and worksLength which is linear to the total number of lines $n$ and therefore has a complexity of $O(n)$.worksLength is also linear to $n$ and therefore also has a complexity of $O(n)$.worksLength:jq sort_by in https://devdocs.io/jq/ we assume that they use a merge sort because it's not specifically stated but own research suggested a lot merge sort.Overall the sorting operation is the most dominant part and therefore our algorithm has a time complexity of $T(O(n \log n))$.
Now we check for correctness proving by induction:
Base case:
Our algorithm interates threw entries of a json file. Therefore our base case is if a file only has one entry. The algorithm would need only one iteration to parse the line because it does not need to be sorted. And indeed for $n=1$ the complexity of the algorithm is $O(1)$.
Inductive hypothesis:
We assume a recurrence relation with a time complexity of $T(k) ≤ c\cdot k\cdot\log(k)\text{ for all }k<n$.
Induction step:
The recurrence relation for a merge sort is typically $T(n) = 2T(n/2)+O(n)$.
Now we combine this relation with our inductive hypothesis and get
\begin{align} T(n) &= 2 \cdot T(n/2)+O(n) \\ &≤ 2 \cdot (c\cdot(n/2)\cdot\log(n/2)) + O(n) \\ &= c \cdot n \cdot \log(n/2) + O(n) \\ &= c \cdot n \cdot (\log(n) - \log(2)) + O(n) \\ &= c \cdot n \cdot \log(n) - c \cdot n + O(n) \\ &= c \cdot n \cdot \log(n) - O(n) + O(n) \\ &= c \cdot n \cdot \log(n). \end{align}$c$ is a constant which we can ignore. So we end up with $O(n \log n)$ and proved that the algorithm is correct.
Bellow you can find the code for the report that provides the top 5 most frequently used tags, the number of times they appear in the lists, and its running time. We parse through the whole json file line by line and checking for every tag if it already exists in our created counter variable and if yes the value of the already existing counter entry is raised by one or otherwise a new entry is being created.
start_time = time.time()
# Create counter variable
tag_counts = Counter()
# Parse through the whole json file
with open(list_file, 'r') as f:
for line in f:
item = json.loads(line)
# Either create a new entry if tag doesn't exist yet or add one to already existing counter entry
try:
for tag in item['tags']:
tag_counts[tag] += 1
except:
pass
# Extract only the top 5 tags
top_tags = tag_counts.most_common(5)
print(tabulate(top_tags, headers=['tag', '#usage'], tablefmt="psql"))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Script execution time: {elapsed_time:.2f} seconds")
+-----------------+----------+ | tag | #usage | |-----------------+----------| | romance | 6001 | | fiction | 5291 | | young-adult | 5016 | | fantasy | 3666 | | science-fiction | 2779 | +-----------------+----------+ Script execution time: 20.04 seconds
In the next step we created an instance using the AWS EC2 service with the following setups:
API call: RunInstances
{
"MaxCount": 1,
"MinCount": 1,
"ImageId": "ami-0fc5d935ebf8bc3bc",
"InstanceType": "t2.micro",
"KeyName": "aws_key",
"EbsOptimized": false,
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"Encrypted": false,
"DeleteOnTermination": true,
"SnapshotId": "snap-0fe62e94bc2ecc9d5",
"VolumeSize": 30,
"VolumeType": "gp2"
}
}
],
"NetworkInterfaces": [
{
"AssociatePublicIpAddress": true,
"DeviceIndex": 0,
"Groups": [
"<groupId of the new security group created below>"
]
}
],
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value": "ubuntu-instance"
}
]
}
],
"PrivateDnsNameOptions": {
"HostnameType": "ip-name",
"EnableResourceNameDnsARecord": true,
"EnableResourceNameDnsAAAARecord": false
}
}
API call: CreateSecurityGroup
{
"GroupName": "launch-wizard-1",
"Description": "launch-wizard-1 created 2023-10-28T12:42:23.590Z",
"VpcId": "vpc-0cbb10add1a4e6a4b"
}
API call: AuthorizeSecurityGroupIngress
{
"GroupId": "<groupId of the security group created above>",
"IpPermissions": [
{
"IpProtocol": "tcp",
"FromPort": 22,
"ToPort": 22,
"IpRanges": [
{
"CidrIp": "0.0.0.0/0"
}
]
}
]
}
| Number of instances | 1 |
| Software Image (AMI) | Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-09-19 |
| Virtual server type (instance type) | t2.micro |
| Firewall (security group) | New security group |
| Storage (volumes) | 1 volume(s) - 30 GiB |
After setting up the instance we connected via a terminal with the following command:
ssh -i "/Users/andre/Data Science/Semester_1/ADM/HW2/aws_key.pem" ubuntu@ec2-184-72-66-207.compute-1.amazonaws.com
After connected to the instance we uploaded the list.json file and our python scriot:
scp -i "/Users/andre/Data Science/Semester_1/ADM/HW2/aws_key.pem" "/Users/andre/Data Science/Semester_1/ADM/HW2/files/list.json" ubuntu@ec2-184-72-66-207.compute-1.amazonaws.com:/home/ubuntu/
scp -i "/Users/andre/Data Science/Semester_1/ADM/HW2/aws_key.pem" "/Users/andre/Data Science/Semester_1/ADM/HW2/files/list.json" ubuntu@ec2-184-72-66-207.compute-1.amazonaws.com:/home/ubuntu/
Before running the script we had to do the following steps:
sudo apt update
sudo apt install python3-pip
pip install tabulate
Now we were able to run our script with:
python3 awsq_script.py
Output:
+-----------------+----------+
| tag | #usage |
|-----------------+----------|
| romance | 6001 |
| fiction | 5291 |
| young-adult | 5016 |
| fantasy | 3666 |
| science-fiction | 2779 |
+-----------------+----------+
Script execution time: 20.68 seconds
Comment: As we can see the execution time is on the AWS EC2 instance higher than on our local system. A possbile explination could be that we are using a comparable very week instance with only one core and one GB of memory compare to our local system that works on 4 cores and 8 GB of memory.
Assume you are working as a librarian at a public library in Rome. Some new books have arrived, and you are in charge of putting them on a shelf. Your supervisor will give you some instructions, and you will follow his. Each book has a unique ID, and your boss's instructions can be of the following types:
You must follow your boss's instructions and report the answers to type 3 instructions to him. He guarantees that if he has a type 3 instruction for a book with a specific ID, the book has already been placed on the shelf.
Remember that once you've answered a type 3 instruction, the order of the books does not change.
Input:
The first line contains a single number, n, representing the number of your boss's instructions. The ith instruction the boss gives can be found at each of the following n lines.
Output:
Print your boss's type 3 instructions in the order they appear in the input.
n = int(input()) # get the number of instructions
shelf = [] # start an empty shelf
answers = ""
for _ in range(n):
ins,id = input().split() # read the boss's instructions
if ins == 'L':
shelf.insert(0, int(id)) # we insert it at the start of the shelf
elif ins == 'R':
shelf.append(int(id)) # we insert it at the end of the shelf
else:
book_id = int(id)
left_pops = shelf.index(book_id)
right_pops = len(shelf) - left_pops - 1 # calculate the minimum number of pops
answers+=str(min(left_pops, right_pops))+"\n" # store the answer
print(answers)
8 L 75 R 20 R 30 L 11 ? 75 L 12 L 15 ? 20 1 1
As suggested by the chatbot Bard the code has a time complexity of O(nlogn).
The for loop iterates over the input n times. For each iteration, the shelf.insert(0, int(id)) or shelf.append(int(id)) operation is called, which takes O(log n) time in the worst case. This is because the shelf is implemented as a sorted list, and inserting or appending an element to a sorted list takes logarithmic time.
The book_id is then searched for in the shelf using the shelf.index(book_id) operation, which also takes O(log n) time in the worst case.
Finally, the min(left_pops, right_pops) function is called to determine the minimum number of pops required to access the book. This operation takes constant time.
Overall, the time complexity of the code snippet is O(n log n) because the dominant operation is the shelf.insert(0, int(id)) or shelf.append(int(id)) operation, which takes O(log n) time in the worst case.
Using a linked list, for example, it would have a total time complexity of O(n). This is because in thic case the insert, append and index functions would have on average 0(1) time.